tag:blogger.com,1999:blog-47279302225607085282024-03-19T16:18:27.541+08:00Aaron Tay's Musings about librarianship <a href="http://musingsaboutlibrarianship.blogspot.com/p/subscribe-by-email.html">(Subscribe by email)</a>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.comBlogger37715tag:blogger.com,1999:blog-4727930222560708528.post-82658491102647606652024-02-24T14:54:00.011+08:002024-02-27T08:18:23.657+08:00How learning evidence synthesis (systematic reviews etc) changed the way I search and some thoughts about semantic search as a complement search technique<p>I've spent a large part of my career as an academic librarian studying the question of discovery from many angles. Roughly speaking they include</p><p></p><ul style="text-align: left;"><li><b>The "<a href="https://musingsaboutlibrarianship.blogspot.com/search/label/web%20scale%20discovery">web scale library discovery</a>" (now known as Library discovery layer) angle</b> - This was in the 2010s to 2015 where I tried to implement and spent long hours thinking and studying the impact of the now ubiquitous library discovery searches like Primo, Summon , EDS</li></ul><ul style="text-align: left;"><li><b>Citation indexes and bibliometrics angle - Web of Science, Scopus and <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/google%20scholar">Google Scholar </a>- </b>Thinking of web scale discovery tools naturally led me to think how they differed from Google Scholar and cross disciplinary citation indexes like Scopus, Web of Science. Here I studied <a href="https://musingsaboutlibrarianship.blogspot.com/search?q=bibliometrics">bibliometrics</a> and more importantly how citation indexes were constructed and clarified my understanding on what citation indexes really were vs abstract and indexing databases and other types of search engines</li></ul><ul style="text-align: left;"><li><b>Open Access and Open Scholarly metadata angle - </b>Here I learnt a bit more about the raw materials of search indexes, from <a href="https://musingsaboutlibrarianship.blogspot.com/2022/04/5-things-you-may-not-know-about-dois-or.html">understanding more about DOI registration agencies like Crossref/Datacite</a>, <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/OA%20discovery">linking to Open Access papers and datasets</a> via Unpaywall etc, the rise of <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/open%20citations">Open Scholarly meta data</a>, thanks to the efforts of grassroot groups like <a href="https://i4oc.org/">I4OC</a> and <a href="https://musingsaboutlibrarianship.blogspot.com/2020/07/why-openly-available-abstracts-are.html">I4OA</a> and watched as the fruits of such open data lead to the<a href="https://musingsaboutlibrarianship.blogspot.com/2020/11/the-next-generation-discovery-citation.html"> rise in large mega academic searches like Dimensions, Lens.org, OpenAlex</a> etc as well as what I call <a href="https://musingsaboutlibrarianship.blogspot.com/2022/08/citation-based-literature-mapping-tools.html">citation based literature mapping services</a> like Connected Papers and ResearchRabbit. I even did a minor detour studying <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/delivery">the "delivery" angle, on how to make authenication and authorization processes for content and services as seamless as possible</a></li></ul><ul style="text-align: left;"><li><b>Information retrieval and Large Language Models</b> - More recently, I started looking at the magic of <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/large%20language%20model">Large Language Models</a>, <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/retrieval%20augmented%20generation">Retrieval augmented Generation based search </a>and started reading more formally things from traditional information retrieval areas since this is the foundation of search. This itself now draws from two different subfields of Computer Science - NLP (eg BERT and transformers) and traditional Information retrieval focusing on systems (think TF-IDF, BM25, Text Retrieval Conferences (TREC) etc), and this <a href="https://arxiv.org/pdf/2010.06467.pdf">survey provides a good introduction of the later to practioners of the former.</a></li></ul><div>There are of course many other areas that I wish I had more knowledge of, such as looking at it from the lens of linked data/knowledge graphs, metadata schemas etc but throughout it all, I was long aware of a body of knowledge posessed by a special group of librarians who were masters at evidence synthesis. I knew they went way beyond the basic nested boolean search strategies that we taught at our freshman information literacy classes and all I knew was they were truly masters of search.</div><div><br /></div><div>As ChatGPT put it in a whimsical way </div><p></p><div><blockquote><blockquote>In a realm where boolean search strategies were mere stepping stones for freshmen wandering the vast libraries of information literacy, there existed a clandestine order of librarians. These weren't your garden-variety book custodians, oh no! They were the fabled evidence synthesis wizards, cloaked in the mastery of MeSH and fluent in the ancient dialects of the Evidence-Based Medicine Pyramids. With a flick of their wrist, they navigated the treacherous terrains of RCT and research designs with unparalleled grace. They were the unsung heroes in the quest for knowledge, embarking on epic journeys to vanquish the dragons of bias and conjure the magical elixirs of meta-analysis. Through their sage understanding of reproducibility, they wove together strands of disparate data, creating tapestries of insight that illuminated the darkest corners of inquiry. In the hallowed halls of academia, they were the keepers of truth, guardians of the grail of evidence-based practice, and the bridge between the realms of known and unknown.</blockquote></blockquote><p>I was so in awe about them when 10 years ago, a Phd in area of Public Policy approached me for help with a systematic review, I threw up my hands and directed them to the medicial librarians (at my former place of work).</p><p>But still I couldn't help but notice the stuff they studied and produced was SO useful. For example, I was blown way by <a href="https://pubmed.ncbi.nlm.nih.gov/?term=Gusenbauer+M&cauthor_id=31614060">Neal Haddway and Michael Gusenbauer's amazing work analysing the tiniest details on the capabilities of a wide variety of academic search engines</a> (see also <a href="https://www.searchsmart.org/?~()">SearchSmart a free tool I use sometimes to check for academic search tools</a>) and <a href="https://www.eshackathon.org/">the interesting work released at the yearly evidence synthesis hackathons</a>. </p><p>Even before that I was really impressed by <a href="https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=can+google+scholar+be+used+for+systematic+review&btnG=&inst=14102473421921925766">the rigor of the work they produced around the possibility of using Google Scholar for systematic reviews</a>. Or how evidence based the guidance in conducting standards like the<a href="https://training.cochrane.org/handbook/current/chapter-04"> Cochrane Handbook for Systematic Reviews of Interventions</a> was(I loved reading papers where they try to determine the value add of various supplementary methods such as citation searching).</p><p>But in 2020, with COVID in full swing, I finally decided to buckle in and try to learn formally this magical art of evidence synthesis. I was partly inspired by all the debates around meta-analysis on COVID related issues (e.g. meta-analysises giving totally different estimates on fatality rates of COVID-19) and partly I thought given my interest in all things discovery, it was high time, I learnt evidence synthesis formally.</p><h2 style="text-align: left;"><br />1. Learning evidence synthesis at least the principle of it wasn't really that tough for me</h2><div>The main barrier to me learning evidence synthesis is that I really wasn't that interested in clinical medical type subjects, but when<a href="https://www.campbellcollaboration.org/blog/first-systematic-review-online-course-for-social-science.html"> the Campbell Collobration made a systematic review and meta-analysis online course that focuses on social science disciplines in June 2023</a>, I immediately took the opportunity to take the free course!</div><div><br /></div><div>To be honest, while doing the course, I found most of it pretty familar to me.</div><div><br /></div><div>Part of it was my general familarity with the technical aspects of database and searching. For example, I was long aware that <a href="https://twitter.com/nealhaddaway/status/1093810798793814016">Web of Science is not a database but a platform </a>and I even know that specifying <a href="https://twitter.com/Martin_A_Nunez/status/1386648795535052800">Web of Science Core Collection is technically not sufficiently reproducible because different institutions may have different holdings/year coverage!</a></div><div><br /></div><div>Also through my time studying open science issues and reading about reproducibility, made me sufficiently familar with the idea of evidence synthesis as a reproducible search. The idea that one should look for grey literature and not filter to only "top" journals because of the <a href="https://en.wikipedia.org/wiki/Publication_bias#:~:text=Publication%20bias%20is%20sometimes%20called,a%20bias%20in%20published%20research.">file drawer effect</a> (and do critical apprasial for quality assessment) were not new ideas to me.</div><div><br /></div><div>I also was included in some past efforts to see how reproducible Google Scholar across different locations...</div><div><br /></div><div>As expected the parts I found most difficult involved statistics and critical appraisal as I was relatively weak at understanding research design. </div><div><br /></div><div>See some of my thoughts in tweets <a href=" https://twitter.com/aarontay/status/1663870085130510336">here</a> and <a href=" https://twitter.com/aarontay/status/1666414807074086915">here</a>.</div><div><br /></div><div>I still don't think I am really a evidence synthesis librarian (I don't get enough real practice) but I know enough to fake being one :) </div><p><br /></p><p><br /></p><h2 style="text-align: left;">2. Even though I don't need to do full fledged Evidence synthesis searching often, it has changed the way I search</h2><p>Since I formally took a course on evidence synthesis, I had only a few opportunities to formally do evidence synthesis once (and it was actually a update of a existing systematic review where I helped translate existing search strategies to the platforms we had access to). </p><p>In general, my institution just doesn't do a lot of formal systematic reviews or meta-analysis, though there is quite a lot of demand for help with doing literature reviews.</p><p><i>Still, I find that learning how to do evidence synthesis has slowly changed the way I think about searching.</i></p><p>IMHO, the fundamental difference between librarians trained to do evidence synthesis and most searchers including even librarians (particularly people who come from information literacy) is that they have a viewpoint that aims at high sensitivity or high recall searches, while the average searcher or even librarian tends to optimise for higher precision.</p><p></p><blockquote>Mind you, that itself is not wrong if you are aware of what you are doing and that is what you are going for. But I find, often people only have that one mode of searching and aren't aware they could do it differently.</blockquote><p></p><p>For example, in the past when asked to construct a search query, I would try a few keywords, maybe combine them together in a nested boolean fashion and skim through the first few results to see if the results were mostly relevant.</p><p>While I am not saying precision in searches isn't important to save time, this was still fundamentally a viewpoint that worried about saving time and/or getting a "good enough search" as opposed to one that optimised or at least equally slanted towards not missing relevant results when possible.</p><p>More recently, I started to do a small but simple thing. I would ask for (or find myself) some target papers that were definitely relevant and kept it aside. Independently of this, I would craft a search strategy and try it out.</p><p>After doing this a few times,<b><i> I was amazed at how easy it was to make a silly mistake creating search strategies. I would inevitable come up with search strategies that miss out target papers. </i></b></p><p>So for example, I would search in Scopus with my first search strategy and realize I made a silly error forgetting to do a wildcard or forgetting an obvious synonym and miss out a relevant paper (and yes the missed paper was indexed in Scopus). Other times, I would add keywords or whole concepts that I thought were definitely was needed and it turned out the relevant paper did not have that at all.</p><p>Worse yet, <b><i>I notice taking the word of a domain expert (someone familar with the literature to some degree) didn't necessarily mean you were safe.</i></b></p><p>For example, I was working with a domain expert in a certain area relating to environment science and the person insisted that all the papers he needed was of the following form</p><p>(A OR B OR C) AND (D OR E) AND (F OR G)</p><p>I was a bit dubious on whether F OR G was necessary but I was assured it was. But when we did the search live, we noticed one of the target papers was missing. It took a while to figure out why, but in the end, we realized the paper lacked the keyword F but instead had H!</p><p>I have done this enough times now to realize this isn't unusual. </p><p>Notice how if all you did was to look at your search results (without having a seperate target paper to compare against), you would by definition not notice your search wasn't the best.</p><p>That said, I do understand most of the researchers I help with literature review searches even though they claim to want not to miss relevant papers they definitely have a lower threshold on how much effort they want to spend looking through or screening papers than most researchers doing actual evidence synthesis.</p><p><i>That is why I tend to ask the researcher I am helping how many papers they are willing to screen though to help guage how broad a search strategy to use and make decisions on trade-offs.</i></p><p>As a librarian helping with the search strategy, I realize I have to straddle the challenge of trying to increase recall but at the risk of the researcher deciding not to use the search strategy I suggest because it has too much results, and at the other end filtering too much to reduce the number or results to a reasonable number and lose relevant papers.</p><p></p><blockquote>In my context, a lot of the researchers I help insist we narrow down to a limited set of journals, which is of course not kosher by evidence synthesis principles (bias from only looking at top impact journals!). But this also means , I can afford to do VERY broad searches (since limiting by journals will cut down on a lot) and the cost of missing even one relevant paper is far worse than usual</blockquote><p></p><p>This is also why<i> live searches and experimenting are very important, rather than discussing search keywords on paper abstractly</i>. Even domain experts often do not have a sense of how keywords they propose will affect the number of results retrieved.</p><p></p><blockquote>BTW thinking of sources to search is another area where evidence synthesis thinking leads you to do things differently but this is worth another post</blockquote><p></p><p><br /></p><h2 style="text-align: left;">How do you know the search strategy is effective?</h2><p>A few collegues also joined me to do online courses on learning evidence synthesis, and one comment I heard the most is they were surprised that these courses were mostly focus on theory and methods but did not actually make one pratice on keyword construction search strategies.</p><p>Sure things like <a href="https://www.sciencedirect.com/science/article/pii/S0895435616000585">PRESS Peer Review of Electronic Search Strategies: 2015 Guideline Statement</a> exist but even following the checklist feels like more an art than a science.</p><p>There doesn't seem to be a way to actually learn to get good at constructing search strategies beyond lots of practice.</p><p>But here we run into a problem, to actually learn from practice, we need actual feedback. But how do we know if we did well or badly? <i> </i>This is why I suspect many searchers think they are way better searchers than they actually are , particularly if they don't go out of their way to test their searches.</p><p></p><blockquote><i>Again imagine a scenario where you are asked to help a research team come up with keywords. You do your best and use some common sense keywords, chain them in a nested boolean fashion and skim the first n results and they look okay. Worse, you hand it off to a research team and they never come back to you to iterate the search or maybe they actually just use exactly what you found and screen down and cite what was found. How would you ever know your search didn't miss out obvious papers?</i></blockquote><p></p><p>As I said, the simple act of trying to see if my search strategy found relevant target papers , greatly shook my confidence that my search strategies were good. This made me paranoid. How good were our searches anyway even after I corrected for the obvious mistakes?</p><p></p><blockquote>I know of course that in reality properly done evidence synthesis searches, do more than just boolean searches across multiple sources but also consult experts, do hand searching, do citation chasing etc. Here, I am just considered about the search strategies.</blockquote><p></p><p>One of the things I recently did was to actually look at papers that resulted from support by the literature search support we did.</p><p>So for example, we were recently acknowledged in a review paper which covered an extremely broad topic. I extracted the references and compared it to see how many of the candiates papers we gave. We in fact did the following</p><p></p><ol style="text-align: left;"><li>Did a Scopus search with a fairly complicated nested boolean search - roughly 3,700 results</li><li>Was provided with a set of 40 known relevant papers - using <a href="https://www.eshackathon.org/software/citationchaser.html">Citationchaser</a> - extracted citations (1,388) and references (950)</li></ol><div><br /></div><div>I am not sure if this is a fair test but I was wondering of the 83 references that were finally cited how many of them were in the unique set of 1+2 ?</div><div><br /></div><div>For purposes of this blog post the actual results don't actually matter. But essentially I found of the 83 references cited only 19 were in (1) + (2). Is this good , bad or expected?</div><div><br /></div><div>To further complicate matters of the 83, 33 were from the 40 known relevant papers, so realistically, we could find only 83-33 = 50 of the remaining papers. So say if (1) + (2) found 19, it is a rate of 19/50 or 48%.</div><blockquote><div>I am also wondering if the researcher we gave the results to actually screened the candiate papers we gave her and found even more papers on top of those, or if she ignored it and did the search her own way and after iterating the search many times found the additional 50. </div></blockquote><div></div><p></p><p><i>I am still thinking where this is a fair way of judging the quality of search strategy. This is because it may not be realistic to expect just keyword search to find most of the relevant papers and in reality you need a combination of different techniques.</i></p><h2 style="text-align: left;"><br /></h2><h2 style="text-align: left;">3. Semantic Search as a complement</h2><p>One of the ways in which new academic searches like <a href="http://elicit.com">Elicit</a>, <a href="http://typeset.io">SciSpace</a> are changing is in the way they change not how but also what is retrieved in the search results. (Elicit also generates a paragraph of text as a direct answer with citations, but this isn't what I mean here)</p><p>By combining results from both standard lexical search like BM25 (a improved version of TF-IDF) with results from semantic search (typically a dense embedding type system) and doing a reranking (probably with computationally expensive rerankers) , academic searches like Elicit have the possibility of overcoming the shortcomings of straight out lexical keyword searching and surface relevant documents which may not have the exact keyword used. (See <a href="https://medium.com/p/95eb503b48f5">this discussion of Boolean vs Lexical search vs Semantic Search/dense embeddings</a>)</p><p></p><blockquote>It might be some search systems are using only BM25 a lexical search for the first stage retrieval followed by a crossencoder reranker using BERT or similar models. On paper, this might not bring the same benefits as having a semantic/embedding type search as the first stage retriever as the initial BM25 might already miss out relevant documents if the wrong query was used but in practice it's actually still very good and hard to beat as a baseline.</blockquote><p></p><p>One thing I have been trying is to try out searches in Elicit, SciSpace with queries and looking at the top 50 results for relevant papers and using this as target relevant papers. You could of course do this with a typical lexical search based system only and it will still give you some clues on what keywords to use but clearly this will be a lot more effective for search engines that also use semantic search/embedding type technology as this might surface papers that don't have the query terms at all.</p><p>One of the tricks about using systems like Elicit is that they recommend that you type what you want in full natural language and not keyword search (where you drop the stop words) for better results.</p><p> I probably will cover this in another blog post, but the literature I am looking at suggests (based on experiments) that transformer based systems (typically BERT or dervivatives) do better ranking and reranking if the query terms are in natural language as opposed to keyword.</p><p>So for example do </p><blockquote><p>Is there an open access citation advantage?</p></blockquote><p>as opposed to</p><p></p><blockquote>open access citation advantage?</blockquote><p></p><p>This is what we would expect using purely theory, because unlike standard bag of word/lexical search methods, BERT type models are able to take into account order of words as well as words that might be considered stop words typically.</p><p>Below shows a simplistic example of how embeddings "understand" words in the context of a sentence rather than just by individual words.</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY9jHwqftiSaa9EGFLpyfggWXHi_dHsa849uzyMNBycPSTSjEO1RkhvDfVPE2ybJz6hYObn4xQs6e605N6dA255GvyVkkL7B7BdoUulMr1LHxm3Nkrw0ehyG-I0C5LMTlPCos6bAf3tt8m0BeNkYC5n9VsgG18_vBmRUkdJBWUW4Zwr94uOEPpkPpUqKWH/s706/embeddinggood2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="400" data-original-width="706" height="362" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgY9jHwqftiSaa9EGFLpyfggWXHi_dHsa849uzyMNBycPSTSjEO1RkhvDfVPE2ybJz6hYObn4xQs6e605N6dA255GvyVkkL7B7BdoUulMr1LHxm3Nkrw0ehyG-I0C5LMTlPCos6bAf3tt8m0BeNkYC5n9VsgG18_vBmRUkdJBWUW4Zwr94uOEPpkPpUqKWH/w640-h362/embeddinggood2.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><p>Above shows that when you embedding the above four sentences in <a href="https://huggingface.co/tomaarsen/mpnet-base-nli-matryoshka">a state of art sentence embedding model </a>and try to compute cosine similarity, the system knows that</p><p>"It is not raining outside!" is most similar to "It is sunny" (score=0.7589) as opposed to "It is raining slightly" (score=0.6758) and "It is raining heavily (score=0.5237). </p><p>It seems to "understand" the "not" in that sentence modifies "raining"...</p><blockquote><p><i>Interestingly when I use a smaller emebdding body with only 64 dimensions instead of 1024, the scores are 0.8428, 0.7141, 0.5489 keeping the same ordinal ranking, so one can probably save storage space using a small embedding model with just 64 dimensions! This shows the whole point of the </i><i><a href="https://huggingface.co/blog/matryoshka">Matryoshka Embedding Models</a> Or Russian Doll embedding models where the first few numbers in the embedding are more important so you can use smaller embeddings with roughly the same result.</i></p></blockquote><h2 style="text-align: left;">As a technical aside</h2><p>I've been reading the literature on information retrieval and current state of art efforts involve trying to use the superior NLP capabilities of cross-encoder models to "teach" simpler sparse embedding models to better represent queries or documents, the hope here is to automatically train sparse embedding models that are as accurate as the dense embedding models but are less computionally expensive to run than dense embeddings and are more interpretable to boot.</p><p>For example, representing weighting using TF-IDF clearly isn't the best way to capture semantic meaning, but with the right "teaching" by BERT models, the weights could take into account synonyms via query or document expansion etc. </p><p>See for example, <a href="https://github.com/AdeDZY/DeepCT">DeepCT and HDCT: Context-Aware Term Importance Estimation For First Stage Retrieval</a></p><p><span face="-apple-system, BlinkMacSystemFont, "Segoe UI", "Noto Sans", Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji"" style="background-color: white; color: #1f2328; font-size: 16px;"></span></p><blockquote>Term frequency is a common method for identifying the importance of a term in a query or document. But it is a weak signal. This work proposes a Deep Contextualized Term Weighting framework that learns to map BERT's contextualized text representations to context-aware term weights for sentences and passages.</blockquote><p></p><p>See also <a href="https://arxiv.org/abs/2104.07186">Deepimpact</a>, <a href="https://arxiv.org/abs/2104.07186">COIL</a> etc. </p><p>For example, <a href="https://europe.naverlabs.com/blog/splade-a-sparse-bi-encoder-bert-based-model-achieves-effective-and-efficient-first-stage-ranking/">SPLADE is a neural retrieval model which learns query/document sparse expansion via the BERT MLM head and sparse regularization.</a> </p><p>It seems to me the same idea can be transferred to using such models typically BERT type models to help human systematic review librarians adjust their keywords?</p><h2 style="text-align: left;"> Conclusion</h2><p>Experienced/real evidence synthesis librarians looking at this piece probably might think what I am saying here is obvious but to me, a pretty experienced librarian in the area of discovery , a lot of what i wrote above is fairly recent insights thanks to thinking a bit more like a evidence synthesis librarian.</p></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.com0tag:blogger.com,1999:blog-4727930222560708528.post-29597424945227743372024-01-27T03:15:00.004+08:002024-01-31T02:06:39.242+08:00Things I am still wondering about generative AI + Search in 2024 - impact of semantic search, generation of answers with citations and more..<p><i>Earlier related pieces - <a href="https://musingsaboutlibrarianship.blogspot.com/2023/03/how-q-systems-based-on-large-language.html">How Q&A systems based on large language models (eg GPT4) will change things if they become the dominant search paradigm - 9 implications for libraries</a></i></p><p>In the ever-evolving landscape of information retrieval and library science, the emergence of large language models, particularly those based on the transformer architecture like GPT-4, has opened up a Pandora's box of possibilities and challenges. </p><p>As someone who first started <a href="https://musingsaboutlibrarianship.blogspot.com/2020/07/why-gpt-3-might-be-greatest-disruption.html">playing around with GPT3 in 2020</a>, I started focusing even more on large language models especially focusing on the concept of Retrieval Augmented Generation (RAG) in 2023.</p><p>My journey has been a blend of academic rigor and tech enthusiasm, combing through research papers, absorbing insights from YouTube videos and Substacks (<a href="https://cameronrwolfe.me/">Cameron R. Wolfe's pieces are particularly enlightening</a>, such as this one on <a href="https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search">AI powered/Vector search</a> though requiring some technical understanding), and exploring the blog posts of companies like <a href="https://blog.vespa.ai/">Vespa.ai</a>, <a href="https://www.pinecone.io/learn/">Pinecone</a>, <a href="https://txt.cohere.com/">Cohere</a>, <a href="https://huggingface.co/learn/nlp-course/chapter1/1">Hugging Face</a>, <a href="https://blog.llamaindex.ai/">LlamaIndex</a>, <a href="https://blog.langchain.dev/">Langchain</a> that provide frameworks or search infrastructure for running RAG.</p><p>Still, there are many things I wonder about and most of them are not relating to technical issues.</p><p> </p><h3 style="text-align: left;">1. Will Academic search engines producing direct answers with citations be the norm? What are the implications? Will it slow down open abstract and open scholarly metadata movement?</h3><p>As I look at all the new "<a href="https://musingsaboutlibrarianship.blogspot.com/p/list-of-academic-search-engines-that.html">AI powered academic search" (see my list) </a>features , whether those added into existing academic databases like Scopus, Dimensions.ai, Primo or from brand new search engines like Elicit.com, Scispace etc, the most common feature by far is the system generating a direct answer to the query backed-up with citations from multiple documents.</p><p><i></i></p><blockquote><i>This is not a particular technique used by academic search, everything from Bing Chat/now Copilot, ChatGPT+ (including the older plugins and Custom GPTs) , even <a href="https://musingsaboutlibrarianship.blogspot.com/2023/12/googles-search-generative-experience.html">Google's Search Generative Experience</a> all use the exact same technique.</i></blockquote><p></p><p>This as I have blogged many times already is <a href="https://musingsaboutlibrarianship.blogspot.com/search/label/retrieval%20augmented%20generation">build on the paradigm of RAG</a> and typically involves a two stage process of </p><p></p><ul style="text-align: left;"><li>Retrieving documents or more usually chunks of text (context) that might be "relevant" to answering the question (typically relevant chunks are identified using vector or embeddings based similarity search)</li><li>Feeding the chunks of text /context to a Generative AI model like GPT3.5 or GPT4 with a prompt to answer the query using the context found</li></ul><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNIwEQf0W79yZO1Nr5V3TLH9zns17OCA1Z3o7owEG04UPn5M6wqgglg5G9k79qUpS4GsqOOxVQvsYcvzhgXYHJdvqFCVqsW0GY6KFx0YboSxih3hc77xVYZWYNVrrSp03jAnevxHejlAn-YvUFQSVllmPsstM4KLi641o6x7jBRU9JgPGEyaFfzJzrKDS5/s1139/ragexample.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="750" data-original-width="1139" height="422" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiNIwEQf0W79yZO1Nr5V3TLH9zns17OCA1Z3o7owEG04UPn5M6wqgglg5G9k79qUpS4GsqOOxVQvsYcvzhgXYHJdvqFCVqsW0GY6KFx0YboSxih3hc77xVYZWYNVrrSp03jAnevxHejlAn-YvUFQSVllmPsstM4KLi641o6x7jBRU9JgPGEyaFfzJzrKDS5/w640-h422/ragexample.png" width="640" /></a></div><br /><div style="text-align: center;"><i>https://www.pinecone.io/learn/retrieval-augmented-generation/</i></div><div style="text-align: center;"><br /></div><div><br /></div><div><br /></div><div><br /></div><div><i>From the information literacy point of view</i> - I wonder if most users will automatically trust the citation and will not verify.</div><div><br /></div><div><blockquote>Some tools help with highlighting of contexts that were used to generate the answer which makes it easier to verify but many do not. </blockquote></div><div><br /></div><div>It is important to note that unlike generated answers and citations from asking a pure large language model alone, answers that are generated using RAG will almost always generate real citations that were found by the search.</div><div><br /></div><div>However, there is no guarantee that the <i>generated statement and the accompanying citation will match</i>. In the technical literature this is sometimes called citation faithfulness etc.</div><div><br /></div><div>Below shows an example from Scite assistant, where the generated staments and citations do not match.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid6WCTFx-Yg_AeC1KccDHwUJjNT-a2UlqWYDIUTaFQpE1ZXWjoO8MJp0i4FJBUIpKMUMtPGodoaxoLOkPf9qcGdaIqiH3CIwTHaDSe0LFk6zRzyWy2EKtZgF48ZUvt7-C1AoYiBraofpeIendWnJ-m4s4lQdf8Dy2bsZqgIhSJVWjmtKsdBwWK3yNLOgOk/s863/scite.ai-unfaithful.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="546" data-original-width="863" height="404" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEid6WCTFx-Yg_AeC1KccDHwUJjNT-a2UlqWYDIUTaFQpE1ZXWjoO8MJp0i4FJBUIpKMUMtPGodoaxoLOkPf9qcGdaIqiH3CIwTHaDSe0LFk6zRzyWy2EKtZgF48ZUvt7-C1AoYiBraofpeIendWnJ-m4s4lQdf8Dy2bsZqgIhSJVWjmtKsdBwWK3yNLOgOk/w640-h404/scite.ai-unfaithful.png" width="640" /></a></div><div style="text-align: center;"><i>scite assistant generated statement and citations do not match</i></div><div style="text-align: center;"><br /></div><div><br /></div><div>I've notice this problem gets particularly bad for systems where they are no context that can answer the question (which occurs more often if you can restrict what the system can cite like in scite.ai assistant) and this tends to force the system into a corner, where instead of refusing to answer, it tries to force fit a citation into the generated answer, leading to poor citation faithfulness.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvit8xzArGJKjxJoMK-abqnTY4YUaTXh5KDnGBqJNSDP2GPTxaqHX-M3sxIQmIhDwjtMMZa4_FZmEzi1YJu18hIi4j2ipHpwP-eNlIToAyxYqIHO07FRlYX77IhLdE285icedZToptJ2DskFqb0T-kOO5nrdlz4ufxCGTx8TFGCvK7oVCPCUwJcb7CaDVF/s883/scite-tight.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="883" data-original-width="697" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjvit8xzArGJKjxJoMK-abqnTY4YUaTXh5KDnGBqJNSDP2GPTxaqHX-M3sxIQmIhDwjtMMZa4_FZmEzi1YJu18hIi4j2ipHpwP-eNlIToAyxYqIHO07FRlYX77IhLdE285icedZToptJ2DskFqb0T-kOO5nrdlz4ufxCGTx8TFGCvK7oVCPCUwJcb7CaDVF/w506-h640/scite-tight.png" width="506" /></a></div><div style="text-align: center;"><i>with Scite assistant you can force citations to come from very limited set of papers, above forces it to cite only letters in Chemistry from 2022!</i></div><div style="text-align: center;"><br /></div><div><br /></div><div><br /></div><div>This has been verified in many studies, for example in a <a href="https://arxiv.org/pdf/2309.01431.pdf">recent study</a>, they mantipulated the context to be sent to the generator, all documents that did not answer the question. The generator was prompted to reject the question, if the context it found did not answer the question. This is called the negative rejection test and hopefully these Q&A systems will refuse to answer.</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy4E9W5ZPlX_-TMYLZFKHpBBXtGzX0VJmHV6bBoVELT6Zg0Y1g1fMxn_lcd0Z0SSKjd3mZ-G4fk2qm29bHEHBpG8JbmgIRby6WJ6kSqSqaqHX7b7aPSE8U6BS_dkeRXo84kj9hJj5ttdeNKWqRExOXWs3UxHA5ZFTfw6OEaroXjgBxSMrGENf06s-x3FYQ/s393/ragfail.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="393" data-original-width="374" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhy4E9W5ZPlX_-TMYLZFKHpBBXtGzX0VJmHV6bBoVELT6Zg0Y1g1fMxn_lcd0Z0SSKjd3mZ-G4fk2qm29bHEHBpG8JbmgIRby6WJ6kSqSqaqHX7b7aPSE8U6BS_dkeRXo84kj9hJj5ttdeNKWqRExOXWs3UxHA5ZFTfw6OEaroXjgBxSMrGENf06s-x3FYQ/s320/ragfail.png" width="305" /></a></div><br /><div><br /></div><div>For example, this is an example of Scite.ai assistant refusing to answer, when forced to try to answer a query with no good context/documents found.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8489CQcXurwIfOctKwhyphenhyphenlGOytcKoluMYiHhYkjabxflDI4f-nqdKOYESg6mCoviK05Brg4UFD-xh1-PJ3f98pfXb6paGa8d5M1OI-GrPkA5pl8Idow_9MHG7uhEKBZULrQGzQ0JD-ByyMyasITVEu-W2qddf2m63YaSBPu2_ArqhnuY30J13ZNxaYnVsI/s942/scite.ai-refusal.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="217" data-original-width="942" height="148" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi8489CQcXurwIfOctKwhyphenhyphenlGOytcKoluMYiHhYkjabxflDI4f-nqdKOYESg6mCoviK05Brg4UFD-xh1-PJ3f98pfXb6paGa8d5M1OI-GrPkA5pl8Idow_9MHG7uhEKBZULrQGzQ0JD-ByyMyasITVEu-W2qddf2m63YaSBPu2_ArqhnuY30J13ZNxaYnVsI/w640-h148/scite.ai-refusal.png" width="640" /></a></div><br /><div>Unfortunately, this isn't that common.</div><div><br /></div><div>In the above study, despite being instructed to not answer if no context answered the question, the generator still tried to answer most of the time and failed to reject the answer. Using ChatGPT as the generator model it rejected or refused to answer only 25% of the time!</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_MY7mna2drj4FUp4q-3nXl4nL9MeZp4zNX57cib-Lt39Q-vq-5aLZ4S-v70WvFndDrXZEkjmrWXIU5ya7VDmS9ekhxOBMY_cM7cVb9oI6hywvf6_yn_8S4rGXZBr2xcHyHSlto71fzmJtvAybIGMyqMgWfJUd4U_KlYW-MIfbpzd02aTauRlDrYtrmVBa/s877/ragfail2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="488" data-original-width="877" height="356" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh_MY7mna2drj4FUp4q-3nXl4nL9MeZp4zNX57cib-Lt39Q-vq-5aLZ4S-v70WvFndDrXZEkjmrWXIU5ya7VDmS9ekhxOBMY_cM7cVb9oI6hywvf6_yn_8S4rGXZBr2xcHyHSlto71fzmJtvAybIGMyqMgWfJUd4U_KlYW-MIfbpzd02aTauRlDrYtrmVBa/w640-h356/ragfail2.png" width="640" /></a></div><br /><div>Though the study above was searching through news stories this applies to academic search too. </div><div><br /></div><div>A related problem is when "backed into a corner" with no good context that answers the question a RAG system which still "decides" to generate statements that have faithful citations, in which case the generated statements do not seem to answer the question, resulting in poor "answer relevancy". This is a less serious issue since humans can clearly see this happening.</div><div><br /></div><div>All this is <a href="https://musingsaboutlibrarianship.blogspot.com/2023/07/recorded-extended-talk-possible-impact.html">covered in my talk here</a> which also points to another worrying emperical finding , that people tend to rate generated answers that have a low citation accuracies (whether recall or precision) with high fluency and perceived usefulness. </div><div><br /></div><div>This makes a lot of sense when you think about it, but is still extremely worrying.</div><div><br /></div><div>In fact there is a lot of research emerging since 2022, that pretty much shows the numerous reasons RAG can fail (e.g. difficulty of combining information from multiple context, reranking issues) and a <a href="https://arxiv.org/abs/2312.10997">bewildering series of proposed techniques at all parts of the pipeline to try to mitiage this from both academic </a>and industry sources. </div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6R14JovZiLjwiVo5wN74YUZao7SVln0qnkHktoy0wDnynGh3DH8hoUGncuYW56enOlNalVhLoRYHKP2MiKT1Z6m1kYZO1v670YrO1_4IcDY-6wDnGTO5Fz-WhoO2_PaMiTSQNNFamxpSN_lBNc-9YwoDvlE5f19Nf5GOHPsIDKE9QS2ZRrtRon79MqYoe/s1501/ragfail3.jpg" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="746" data-original-width="1501" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh6R14JovZiLjwiVo5wN74YUZao7SVln0qnkHktoy0wDnynGh3DH8hoUGncuYW56enOlNalVhLoRYHKP2MiKT1Z6m1kYZO1v670YrO1_4IcDY-6wDnGTO5Fz-WhoO2_PaMiTSQNNFamxpSN_lBNc-9YwoDvlE5f19Nf5GOHPsIDKE9QS2ZRrtRon79MqYoe/w640-h318/ragfail3.jpg" width="640" /></a><a href="https://arxiv.org/pdf/2401.05856.pdf" style="text-align: left;">Seven Failure Points When Engineering a Retrieval Augmented Generation System</a></div><div><br /></div><div><br /></div><div>Want to do some technical reading on the issues, here are some</div><div><ul style="text-align: left;"><li><a href="https://arxiv.org/pdf/2309.01431.pdf">Benchmarking Large Language Models in Retrieval-Augmented Generation</a></li><li><a href="https://arxiv.org/pdf/2401.05856.pdf">Seven Failure Points When Engineering a Retrieval Augmented Generation System</a></li><li><a href="https://arxiv.org/pdf/2307.16877.pdf">Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering</a></li><li><a href="https://arxiv.org/abs/2304.09848v2">Evaluating Verifiability in Generative Search Engines</a></li></ul></div><div><br /></div><div><i>From the point of view of a supporter of Open Scholarly metadata and Open Access,</i> one also needs to realize that the value of Open Acesss full text has increased. This might lead Publishers to push up the price to make publications Open Access.</div><p>One commenter <a href="https://upstream.force11.org/large-language-publishing/">even suggests</a></p><blockquote>The companies could also pull back from OA altogether, to keep a larger share of exclusive content to mine.</blockquote><p>This is the flip side of naive proposals like this librarian who wonders- <a href="https://medium.com/a-academic-librarians-thoughts-on-open-access/are-we-undervaluing-open-access-by-not-correctly-evaluating-the-potentially-huge-impacts-of-e93af1de9414">Are we undervaluing Open Access by not correctly factoring in the potentially huge impacts of Machine learning? </a></p><p>But one need not even go into open access to worry about implications of Openness. While being able to extract answers from full-text is clearly ideal, it is very computationally expensive (even now I think only Elicit does so for open access papers, Scite assistant uses a subset of full-text ie. the citation statements from partners) and no one source (except maybe Google Scholar!) is likely to have a comprehensive set of papers.</p><p>But this is not true about abstracts. As I write this, Scopus AI is launched out of beta and their generated answers are extracted only from abstracts. Their rival Dimensions' equalvant AI assistant I believe does the same.</p><p>Clearly, you can extract quite a lot of value just from abstracts!</p><p>This does not abode well for the people behind the push for Open Scholarly metadata, in particular the <a href="https://i4oa.org/">Initative for Open Abstracts</a> (disclosure I am a signatory of this group). While there has been <a href="https://www.crossref.org/blog/i4oa-hall-of-fame-2023-edition/">some progress to convince publishers to make abstracts freely available and Open in a machine readable format</a>, <a href="https://twitter.com/MsPhelps/status/1748127434082271353">progress has been slower than her sister movement the Intiative for Open Citations</a>.</p><p>While that fight for <a href="https://musingsaboutlibrarianship.blogspot.com/2021/02/the-era-of-open-citations-and-update-of.html">open citations was eventually won</a>, Elsevier was one of the last major publishers to give it up and make their citations open. Why were they dragging their feet? The obvious guess was that Elsevier was aware doing so would enable the creation of more compete citation indexes based on open data, which would reduce the value of their citation index product Scopus.</p><p>Would the same calculations but this time with open abstracts and AI generated answers lead to the same issue?</p><p><br /></p><h3 style="text-align: left;">2. Will Academic search engines use Semantic Search as a default? What are the implications?</h3><p>The potential shift from lexical keyword searches to semantic search in academic search engines might seem subtle and less interesting than the ability to generate a direct answer, yet its implications are vast. </p><p>I won't recap the differences between Lexical keyword search (or even Boolean) and Semantic Search also known as vector/embedding based search, see</p><p></p><ul style="text-align: left;"><li><a href="https://medium.com/@aarontay/boolean-vs-keyword-lexical-search-vs-semantic-keeping-things-straight-95eb503b48f5">Boolean vs Keyword/Lexical search vs Semantic — keeping things straight</a></li><li><a href="https://musingsaboutlibrarianship.blogspot.com/2023/11/jstor-generative-ai-pilot-or-is.html">is Semantic Search coming for academic databases?</a></li><li><a href="https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings/">Sentence Embeddings. Introduction to Sentence Embeddings </a>& <a href="https://osanseviero.github.io/hackerllama/blog/posts/sentence_embeddings2/">Sentence Embeddings. Cross-encoders and Re-ranking</a></li><li><a href="https://cameronrwolfe.substack.com/p/the-basics-of-ai-powered-vector-search">The Basics of AI-Powered (Vector) Search</a></li></ul><p></p><p>This change suggests a move towards interpreting the intent and context of queries, rather than merely matching keywords. </p><p></p><blockquote><i>It is important to realize that a system can implement semantic search/emedding search and yet not use RAG to generate direct answers over multiple document (for example as of now, <a href="https://library.smu.edu.sg/topics-insights/jstor-generative-ai-beta-first-look">JSTOR generative AI beta has semantic search but does not generate direct answers over multiple documents</a>). It is even possible for a system to do RAG without doing Semantic Search but this is less likely. This could be done for example by asking a LLM to translate a query into a Boolean query, and then run it through a conventional search to find relevant abstracts. <a href="https://arxiv.org/abs/2307.04683">CoreGPT </a>is an example of this.</i></blockquote><p>If you are like me who has been asked to try cutting edge "Semantic Search" tools for over a decade, this feels like just hype. I was also skeptical initially when I saw earlier adopters claim tools like Elicit had better relevancy ranking than Google Scholar but in the last 2 years, I kept running into situations that made me realize there might actually be a real improvement here.</p><p></p><p>Part of the reason why it is hard to notice this improvement is that usually, standard keyword or lexical search works pretty well. Also two decades of searching using keyword has trained us to a) search in a way that masks the limitations of keyword search and b) lower our expectations of what is possible when we should demand more (e.g. search for very specific things) and have higher expections (on what counts as a false drop).</p><p></p><blockquote>Currently, I am testing a tool called <a href="https://www.undermind.ai/">Undermind</a> that <a href="https://app.undermind.ai/static/Undermind_whitepaper.pdf">claims to outperform Google Scholar by x10</a>, however did only applies if you <a href="https://app.undermind.ai/about/#example-search-reports">search with very specific requirements of what should be included and what shouldn't be</a>. </blockquote><p></p><p>Take for example, a recent challenge I had, I was looking to find a study referenced in <a href="https://www.dailymail.co.uk/health/article-12282773/Jobs-raise-risk-ovarian-cancer-REVEALED-work-one-them.html">a news story</a>, and as typical in such things, the exact study was not referenced and only certain findings was mentioned. </p><p>It took me a long while to finally <a href="https://oem.bmj.com/content/oemed/early/2023/06/01/oemed-2022-108557.full.pdf">confirm the study</a> based on the clues in the news story. This was despite using tools like Google Scholar (which had indexed the full-text of the study). </p><p>I then realized that what I could do was to just copy and paste long text chunks from the story that described the study for example the following chunk</p><p></p><blockquote><i>Researchers from the University of Montreal analysed data on women aged 18 to 79.They compared 491 women who had been diagnosed with ovarian cancer with 879 women who didn’t have the disease. working for ten or more years as a hairdresser, barber, beautician or in related roles was associated with a three-fold higher risk of ovarian cancer</i></blockquote><p></p><p>and throw it into a system using semantic search (vector based/embedding search) like Elicit.com and SciSpace and the very first result - Leung (2023) was the correct study!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOBC1X6R92btHvjM1cuIe-QItZvU3H-KhjjXbArPzjwgNelVHO-If4ynNFZYL0buOljgr4dA_ew5sZl1UxUtuzbjqy-2AeAn686ETkSbpEyaNMfVfIIj-SPVHvgxHag4ojahFtYwcrLQWRnWlHJEn36qXyJSjgmWetBvhLnpY1qT8IcBJkF_Zc_N_lyD4h/s768/elicit-semantic.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="768" data-original-width="729" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgOBC1X6R92btHvjM1cuIe-QItZvU3H-KhjjXbArPzjwgNelVHO-If4ynNFZYL0buOljgr4dA_ew5sZl1UxUtuzbjqy-2AeAn686ETkSbpEyaNMfVfIIj-SPVHvgxHag4ojahFtYwcrLQWRnWlHJEn36qXyJSjgmWetBvhLnpY1qT8IcBJkF_Zc_N_lyD4h/w608-h640/elicit-semantic.png" width="608" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: left;">Note : I am not saying it is NOT possible to find the paper using keyword searching. For example, I was initially tripped up by Google Scholar, because here it's size and index of full-text worked against it and a simple keyword of say the occupations + cancer gets you a lot of results and because of it's weighting algo that favours higher citation and hence older papers, you would never find the needed paper, unless you were smart enough to guess the cited paper is likely to be new! Ironically the search is probably easier in a title+abstract only database like Scopus! </div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p>And as a nice side effect, the papers after the first paper were semantically similar.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj60nNFJ72HLo62e-7pZtYS_KG9zgrTA8wKRFjwJ6-vuAF_P44C0HUfi_r8YiL7meaW7xNZS5yrFf5fsztYOAsZ_-uFm7kIBc8kb1xS2uGX0cX00jgxT5nHg5yYG9iMBgbd1jnKsY1oWjonNperyOHinbYX9c0EPOCzqTtZ5mR5v7qTPLEYE8VvyEDjQsFB/s1654/elicit-semantic2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="624" data-original-width="1654" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj60nNFJ72HLo62e-7pZtYS_KG9zgrTA8wKRFjwJ6-vuAF_P44C0HUfi_r8YiL7meaW7xNZS5yrFf5fsztYOAsZ_-uFm7kIBc8kb1xS2uGX0cX00jgxT5nHg5yYG9iMBgbd1jnKsY1oWjonNperyOHinbYX9c0EPOCzqTtZ5mR5v7qTPLEYE8VvyEDjQsFB/w640-h242/elicit-semantic2.png" width="640" /></a></div><br /><p>I remember years ago in the late 2010s to early 2020s, trying tools that encouraged you to type in long sentences describing your topic and it would use "Semantic search" to find relevant papers. On hindsight these tools were probably using the earliest BERT type models or possibly even GPT - GPT2.0 models but I remember not being super impressed by them. </p><p>To clarify what is happening, I wasn't just lucky with the choice of these text chunks, I could paste more or less or even different chunks from the news story describing the study and it would still work almost all the time! </p><blockquote><p><i>Technically search engines like Elicit or Scispace typically will have some maximum character limit for a query so it may be that beyond a certain length it might ignore everything beyond a certain length but I did not test for this.</i></p></blockquote><p>Why? Because the search wasn't actually doing a keyword or lexical match but a similarity match in semantics and the meaning of the query is more or less similar even if you vary a little the text chunks used, and I believe it will almost always rank some paper as "most similar"</p><p>Prior to this I was actually copying single sentences with what I told was unique information like <topic> + sample size of control and test sizes etc into Google Scholar and I just couldn't find the paper despite it being indexed full-text. Searching keyword style by dropping stop words didn't help either.</p><p>Essentially the wording in the new stories are so paraphrased that Google Scholar has trouble matching them unlike a actual semantic search system. This and the fact that Google seems to perform better (see later) gives us a hint that Google Scholar is still mainly a lexical search system.</p><p>Made me curious if I got lucky, so I tried the same thing over a few news articles that go "A study found....". Firstly, I noticed that for the most recent news stories it would usually be referencing a very new paper that was not indexed in tools like SciSpace but leaving that aside it worked almost all the time!</p><p>Interestingly, I found that Google itself performed much better than Google Scholar when you entered long chunks of text. Given we know <a href="https://blog.google/products/search/search-language-understanding-bert/">Google also uses BERT (since 2019) </a> , this explains why it can find the study.</p><p>That said, using Google was not as good because often it would rank other news stories that referenced the same study before surfacing the actual paper or <a href="https://www.google.com/search?q=Researchers+from+the+University+of+Montreal+analysed+data+on+women+aged+18+to+79.+They+compared+491+women+who+had+been+diagnosed+with+ovarian+cancer+with+879+women+who+didn%E2%80%99t+have+the+disease.&rlz=1C1ONGR_enSG1005SG1005&sourceid=chrome&ie=UTF-8">even just fail and list only those news stories.</a> This makes sense because news stories that reference the research article are probably more written in a more similar style than the actual research article!</p><p>This evolution in search technology brings new challenges in search methodologies. It raises questions about the most effective way to search: Should we rely on keywords, natural language queries, or even complex prompts akin to prompt engineering? </p><p><b>Keyword style</b> : Open citation advantage?</p><p><b>Natural language style </b>: Is there an open access citation advantage?</p><p><b>Prompt engineering style</b> : You are an expert in academic searching with deep domain knowledge in the domain of Open access. Search for papers that assess the evidence for the existence of an open access citation advantage.</p><p>As far as I know there is little to no formal study on whether a keyword style search or a Natural language style way of searching is superior particularly in a new Semantic Search style search engine.</p><p>Theory suggests that because Semantic Search can and does take into account order of words etc, you should search in natural language to take full advantage of this and not drop stop words in the way keyword style does.</p><p>For what's it's worth <a href="https://twitter.com/aarontay/status/1743433464882638929">Elicit does indeed suggest it is better to type in natural language, particularly phased as a question.</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijPfCPeHLPvFwEFH3cwJbLDHUVzVgtLrgoTcPI5bIrXa9zlTyOB3CqCB66_0nBus55GwQs_U4RwHxrqhR9lHz-GHN6-B5PQnq2UzYF9W8LO0M2OONLCZjBcMrIrYVEORklwO7sXl_8Gq8xDF5cSJygwUtYbIQq12Jr_vlsHqQcu7OQ9NY42nruqEx0hQiU/s769/elicit-semantic3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="352" data-original-width="769" height="292" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijPfCPeHLPvFwEFH3cwJbLDHUVzVgtLrgoTcPI5bIrXa9zlTyOB3CqCB66_0nBus55GwQs_U4RwHxrqhR9lHz-GHN6-B5PQnq2UzYF9W8LO0M2OONLCZjBcMrIrYVEORklwO7sXl_8Gq8xDF5cSJygwUtYbIQq12Jr_vlsHqQcu7OQ9NY42nruqEx0hQiU/w640-h292/elicit-semantic3.png" width="640" /></a></div><p><br /></p><p>When queried further, Elicit Machine Learning engineers <a href="https://twitter.com/BenRachbach/status/1743945009790107834">made a further interesting clarification, that does so not only helped the generation of a direct answer with citations (known as summary of top 4/8 papers in Elicit) but also the ranking of papers found.</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCaEarF1qBiAZdBFNFS7ab-oxrNe2WctYPF3hSH_8doTbnz095a5DvjVlYX5tfDpwa5P0fzVuRbiac5ZbK8DvCGnYYMuBcdlKQlpXyH78g5ovoN76P08va_ZGolVrBQ6JkynXmIhIBtLXGyirX1I3L6E6CzSWQYVfcYZBHXvjAnP-6yh9UjW8QVAesdxou/s731/elicit-semantic5.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="177" data-original-width="731" height="154" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiCaEarF1qBiAZdBFNFS7ab-oxrNe2WctYPF3hSH_8doTbnz095a5DvjVlYX5tfDpwa5P0fzVuRbiac5ZbK8DvCGnYYMuBcdlKQlpXyH78g5ovoN76P08va_ZGolVrBQ6JkynXmIhIBtLXGyirX1I3L6E6CzSWQYVfcYZBHXvjAnP-6yh9UjW8QVAesdxou/w640-h154/elicit-semantic5.png" width="640" /></a></div><br /><p>I've noticed that for many of the newer AI powered search engines, including the new Scopus AI, the examples you see all use natural language queries....</p><p>Going even beyond natural language searching, with the increased use of ChatGPT and similar chat style interfaces, we may be seeing a generation of users who are used to doing <a href="https://musingsaboutlibrarianship.blogspot.com/2023/06/prompt-engineering-something-for.html">prompt engineering!</a> Would it be productive to do it for new AI powered search?</p><p>For this, I think it's much clearer.</p><p>In general <a href="https://musingsaboutlibrarianship.blogspot.com/2023/04/41-different-ways-large-language-models.html">there is a distinction between </a></p><p>a) Search engines that only searches and uses Large Language Model capabilities to do things like rank results, generate answers with RAG etc</p><p>b) a full blown Large language model (typically generative, autoregressive decoder/GPT type) where search is just one of its capabilities or tools. These are sometimes called "Agents" (if they have unlimited number of steps) which can choose to generate text or use tools it is capabilty of using.</p><p><i></i></p><blockquote><i>About agents: A standard RAG pipeline tends to have a fixed number of steps but an "agent" is given more autonomy, eg in the context of searching to answer questions, it may be allowed to "reason" and decide whether to stop searching or continue searching to get more information, while a standard RAG pipeline would tend to be more predictable.</i></blockquote><p></p><p><br /></p><p>Essentially, you can tell the difference between the two by trying to have a general conversation or asking the system to try to do different tasks like telling jokes, summarise text etc. If it does so it is the latter type.</p><p>See <a href="https://musingsaboutlibrarianship.blogspot.com/2023/04/41-different-ways-large-language-models.html">more details here</a></p><p>Essentially if you are basically a search engine (the former), prompt engineering will not work. Examples of this includes Elicit.com, SciSpace. Non-academic example is perplexity.ai </p><p>However, if you are a LLM that can but does not always search, prompt engineering techniques might work. A academic example is <a href="https://scite.ai/assistant">Scite.ai assistant</a>, non-academic examples include Bing Chat, ChatGPT+, <a href="https://chat.openai.com/gpts">various GPTs</a> and plugins. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNCK4cb91ANgEPSAuRhaqohz-jhR6Z0pPyLM-OqJDG5OUeCOVe3Bn60nOdYApIj3xriZ7Tn3WqZudTbw0jtYVWXM9iAUXq8IMl1cdO80ZT7dBBH40mwJzdoIlu2VmpnnV2ksbBc7NAfTh8Gn6UqDSZW5lmiHvFSjEaaBJD9sWh8oBSegIqkc8-uqtsXfe7/s1654/customgpt1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="624" data-original-width="1654" height="242" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhNCK4cb91ANgEPSAuRhaqohz-jhR6Z0pPyLM-OqJDG5OUeCOVe3Bn60nOdYApIj3xriZ7Tn3WqZudTbw0jtYVWXM9iAUXq8IMl1cdO80ZT7dBBH40mwJzdoIlu2VmpnnV2ksbBc7NAfTh8Gn6UqDSZW5lmiHvFSjEaaBJD9sWh8oBSegIqkc8-uqtsXfe7/w640-h242/customgpt1.png" width="640" /></a></div><div style="text-align: center;"><i><a href="https://chat.openai.com/gpts">Custom GPTs </a>that search including Consensus, Scispace, Scholar.ai etc</i></div><p style="text-align: left;">In a sense, search interfaces will help to reduce confuse on how to search by just giving examples of expected types of input and users should just follow these cues.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUKdhAcDD2bKJd15biK3u2byoqmY_nPs_hiAhRKqU3tFyN2VY209SKVJUxDTFZKyXWH2dV_IHGCB8ppiDnqQu1m_KQwzVlYaQweD16m-8ZnFx3NOZgKUeMAAzE7Qpvltcq0AsAahZ32DwV2AkqGXGzpRcVBK3Jct5XHRCxnK5aHufuuLBuzIrtD7_b3wCF/s1380/sciteassist-prompt.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="430" data-original-width="1380" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUKdhAcDD2bKJd15biK3u2byoqmY_nPs_hiAhRKqU3tFyN2VY209SKVJUxDTFZKyXWH2dV_IHGCB8ppiDnqQu1m_KQwzVlYaQweD16m-8ZnFx3NOZgKUeMAAzE7Qpvltcq0AsAahZ32DwV2AkqGXGzpRcVBK3Jct5XHRCxnK5aHufuuLBuzIrtD7_b3wCF/w640-h200/sciteassist-prompt.png" width="640" /></a></div><div style="text-align: center;"><i>scite assistant giving examples of prompt that you can use, showing it is not just search</i></div><p style="text-align: center;"><br /></p><p>The implications for this shift is pretty obvious.</p><p>First given our heavy focus as the people who need or know how to search, we need to quickly reeducate our librarians on what Semantic Search means, how to identify when one is being used (or at least know enough to know what to ask our vendors) and how to change our search style when necessary. </p><p>I also think given that there are now three potential ways to obtain results - Lexical search, Semantic Search and Citation searching, there could be interesting ways to expose these three capabilities in the search interface. For example, being able to decide when you want to run a precise Lexical search only or when you want to add results from Semantic Search strike me as useful function. Ideally, these set of results shouldn't be "mixed in" but each "layer" would show the new unique results found.</p><p>Secondly, evidence synthesis librarians who are worried that lexical keyword style searching (and even Boolean search which is a subset) might go away in favour of Semantic Search should be able to not only articulate the weakness of Semantic Search compared to lexical search but show evidence of this.</p><p>For example, the obvious strength of lexical search vs Semantic Search is that the former lacks control.</p><p>This isn't just a theorical issue. For example, if you look at both Elicit and Scispace which both implement Semantic Search, you see this filter which allows you to filter down to keywords.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXCDJBx6vNvXs5jhA82El9VNKOL1Ny24nClOmUuZR5-3eavqTQwX7_Y5aqTy_bOaMEKQpJ_odBP14EBXXu8LKg8t_P8ujZOdsEJTDn6y6s1tNuXawnp9NTTi065CZl2XdJx_8CjrL7AK4RjBsS8bFX-9zUDTGVIkn1LsYMsUxoMMLs7eDvjnxLyobAGyvG/s589/elicit-semantic6.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="589" data-original-width="285" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiXCDJBx6vNvXs5jhA82El9VNKOL1Ny24nClOmUuZR5-3eavqTQwX7_Y5aqTy_bOaMEKQpJ_odBP14EBXXu8LKg8t_P8ujZOdsEJTDn6y6s1tNuXawnp9NTTi065CZl2XdJx_8CjrL7AK4RjBsS8bFX-9zUDTGVIkn1LsYMsUxoMMLs7eDvjnxLyobAGyvG/w310-h640/elicit-semantic6.png" width="310" /></a></div><br /><p>Why do you need such a function? This is because by default, the search is semantic and it may surface papers that don't have the exact keyword you want. This can be important if you are looking say for an exact gene, technique etc and you don't want the Semantic Search giving you closely related terms.</p><p>This shows clearly the weakness of Semantic Search.</p><p>Whenever I give talks about Semantic Search, I get worried evidence synthesis librarians worried that search databases will eventually remove traditional search functionality in favour of only Semantic Search.</p><p></p><blockquote>In fact, I believe currently a common industry practice for such "AI search" tends to be to run both Semantic Search AND traditional lexical/keyword search (often with BM25 algothrim which is a slightly improved version of TF-IDF), combine the top few results from both methods and then rerank for a final listing so in many of these searches you might get a combination of both sets of results. So the fear is more the functionality is likely to be there, but it may not be independently made available? </blockquote><p></p><p>I suspect though Semantic Search can become the third major technique , next to keyword searching and citation searching that can be used to help improve recall and precision for evidence synthesis searching. </p><p>At this point, it's still unclear how much Semantic Search techniques help more once you have done proper keyword searching AND citation chasing, but this is an interesting area of study (hint : I am interested in working on a study on this). </p><p><br /></p><h3 style="text-align: left;">3. What will be the business model for established players and new entrants?</h3><p>Business dynamics is an area I am very naive about but even I know many of the smaller players will be acquired by bigger ones probably publishers. While the smaller startups might be more agile and have a lead from starting earlier and having more refined systems (RAG refinement is more an art than science at this point and the earlier innovators like Elicit have more feedback from earlier adopters), they are not necessarily going to come up on top. This is particularly so when the bigger publishers have advantages in terms of access to large pools of meta-data and full-text beyond open sources.</p><p>But the main thing I am wondering is about the pricing model.</p><p>I understand that unlike typical IT systems or even conventional search systems, where the marginal cost of running a query is relatively low or even near zero, with the current AI powered search the cost of running each query is relatively high due to the need to make Large Language Model calls. Whether companies use OpenAI APIs or use their own opensource models, each search is going to cost $$$.</p><p>This might be why Elicit and Scispace are currently doing a charging model by usage. If this model applies to institutional access this would make things somewhat tricky as librarians will have to deal with allocation of credits/usage.</p><p>As a librarian, I am reminded of the time in the past early 2000s (slightly before I joined the profession) when it was common for systems to charge by time or usage like Dialog systems. Because searching online was so costly, librarian did mediated searching and searched on the behalf of users which feels unthinkable in today's world.</p><p>We could of course give a fixed quota per user but this would still be a big adjustment for users. Because we are used to living in a world where search queries are essentially free, we get very sloppy and inefficent with searching doing more iterative searching than precise controlled searching... </p><p>I would like to say this might push users back to carefully crafting and running nested boolean search strategies, but I think by now the ship has sailed and nobody but evidence synthesis librarians and librarians doing search demos in freshman classes do that. Plus, I'm pretty sure boolean functionality isn't even in tools like Elicit! </p><p>Of course, as times goes by, technology costs will get cheaper and eventually we will eventually get back to the current "all you can eat" model but for now one wonders how it will work?</p><p>Or will be the big players like Scopus be able to absorb the costs and give such "all you can eat" models to institutions?</p><p><br /></p><h3 style="text-align: left;">4. How useful will all these new features be?</h3><p>I believe the current knee jerk reaction of many librarians eg information literacy librarians is to warn against AI powered tools because they read that generative AI tools "hallunicate".</p><p>On the other hand, many of the people I speak to who create these new AI search tools, generally take it as an artifact of faith that the error rate of their tools will naturally reduce as large language models improve from year to year.</p><p>Of course all this is just an article of faith, LLMs could hit a plateau (for example Google's DeepMind Gemini models don't seem to be much better than GPT4 has led some to speculate we might have hit a plateau) or even if LLMs improve they may not translate to aims for the RAG type capabilities (though I think this is unlikely). </p><p>My wild speculative view is the ability of systems to extract direct answers with citations in short paragraphs will probably be less impactful than expected at least for academic use.</p><p>It seems to me after playing with these tools for longer than many, I find typically you tend to run into two situations. Either you already know the domain well, in which case, the system's one paragraph direct answer is likely to be clearly inferior to what you know.</p><p></p><blockquote>I find many of these systems even on the best of days when they cite relevant papers don't quite choose the ones I would cite (when there are many close alternatives) or would cite a paper for odd reasons that no human would. But i think this can be fixed.</blockquote><p>On the other hand, if you are new to a domain, you can't trust what is generated and even if you CAN, you probably won't because you need to read and internalize the knowledge yourself, so at best it gives you a start.</p><p>But of course if the technology progresses to the point you can ask it to write VERY SPECIFIC and GRAINULAR one page reviews that you can dump into a study all bets are off... </p><h1 style="text-align: left;">Conclusion</h1><p>This like many pieces I have written in the past is as I expect going to read very cringy and dumb a year or two from now. But it is nice to pen this down to capture my thoughts on this subject as of 2024.</p><p></p>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.com0tag:blogger.com,1999:blog-4727930222560708528.post-58169308904045657702023-12-07T21:22:00.000+08:002023-12-07T21:22:15.264+08:00Google’s Search Generative Experience (SGE) is now available in 120+ countres. What you need to knowNote: This is a <a href="https://library.smu.edu.sg/topics-insights/googles-search-generative-experience-sge-now-available-singapore-what-you-need-know">lightly edited piece of something I wrote for my institution</a><br />
<h2>What is Google’s Search Generative Experience (SGE)?</h2>
<p>In past ResearchRadar pieces, we have discussed about how search engines both general (e.g. <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">Bing Chat</a>, <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">Perplexity</a>) and academic (e.g <a href="https://library.smu.edu.sg/topics-insights/elicitorg-impressive-new-academic-search-engine-leverages-large-language-models">Elicit</a>, <a href="https://library.smu.edu.sg/topics-insights/scite-assistant-academic-search-engine-enhanced-chatgpt">Scite Assistant</a>, <a href="https://library.smu.edu.sg/topics-insights/scopus-dimensions-and-web-science-databases-are-incorporating-generative-ai">Scopus (upcoming)</a>) are <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">integrating search with generative AI (via Large Language Models) using techniques like RAG (Retrieval Augmented Generation)</a>.</p>
<p>But what about Google? <a href="https://blog.google/technology/ai/bard-google-ai-search-updates/">They launched – Bard an experimental conversational AI service, initially powered by LaMDA (later updated to PaLM2) in Feb 2023</a> which was believed to allow them to keep pace with OpenAI’s smash hit ChatGPT. </p>
<p>As <a href="https://library.smu.edu.sg/topics-insights/three-things-about-google-bard-googles-competitor-chatgpt">our earlier review of Bard</a> noted, unlike the free version of ChatGPT, Google Bard integrates search with the language model. However, it is important to note that unlike Bing, these generative AI features were not integrated into their flagship Google search. </p>
<p>This changed with the launch of the Google Search Generative Experience (SGE) in May 2023. An opt-in feature open to the US only, they slowly expanded to more countries until they made it available to over 120 countries including Singapore on November 8, 2023. </p>
<h2>How do I gain access to these features in Singapore or other supported countries? </h2>
<p>You will need to <a href="https://blog.google/products/search/generative-ai-search/">login to your Google Account and turn on the SGE feature</a>.</p>
<figure class="caption caption-img"><span><img alt="Switching onSGE, generative AI in Search on Google" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge1.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Then search with Google while signed in. It is important to note that the SGE feature only works in Chrome Desktop or in mobile on Google app of Android/IOS. There is a reason for that as we will see later. </p>
<h2>Are the answers from SGE accurate? </h2>
<p>Like any other search engine that uses generative AI, (e.g., <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">Bing Chat</a>, <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">Perplexity</a>) and academic (e.g., <a href="https://library.smu.edu.sg/topics-insights/elicitorg-impressive-new-academic-search-engine-leverages-large-language-models">Elicit</a>, <a href="https://library.smu.edu.sg/topics-insights/scite-assistant-academic-search-engine-enhanced-chatgpt">Scite Assistant</a>, <a href="https://library.smu.edu.sg/topics-insights/scopus-dimensions-and-web-science-databases-are-incorporating-generative-ai">Scopus (upcoming)</a>) the answer generated might be wrong.</p>
<p>However, given that these systems now are <a href="https://library.smu.edu.sg/topics-insights/new-bing-chat-and-elicitorg-power-search-engines-large-language-models-llm-gpt">integrating search with generative AI (via Large Language Models) using techniques like RAG (Retrieval Augmented Generation)</a>. Each generated sentence typically comes with a link or citation so you can check the source to verify the answer.</p>
<p>In the example below, the query was where was Lee Kuan Yew born?</p>
<figure class="caption caption-img"><span><img alt="SGE generated response to the query states that Lee Kuan Yew was born at home on September 16, 1923 at 92 Kampong Java Road in Singapore. Singapore was part of the British Empire at the time." class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge-11.png" style="border: 1px solid rgb(238, 238, 238);" /></span>
</figure>
<p>For each generated paragraph in the answer, you can click on the dropdown arrow to find a link/citation that supports the sentence. </p>
<p>In our case, I will click on the first dropdown to reveal the link. This is because I want to verify if the answer “Lee Kuan Yew was born at home on September 16, 1923 at 92 Kampong Java Road in Singapore.” </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge-12.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>The link itself brings you to a Wikipedia article. It is <a href="https://www.theverge.com/23703037/google-chrome-link-to-highlight-how-to">a special type of link</a> that not only brings you to the webpage or source but also brings you directly to the part of the webpage that was used to answer the question. </p>
<p>This <a href="https://www.theverge.com/23703037/google-chrome-link-to-highlight-how-to">type of link</a> works only in Chrome Desktop and Google Mobile app, which is partly why SGE currently works only in these environments. I have found this feature doesn’t work sometimes, for example if it tries to bring you to a PDF for example, you just get to the PDF without the highlighting.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge-13.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Assuming you believe the source is trustworthy, you can confirm that the generated answer is indeed supported by the source and the answer is likely to be correct. </p>
<p>However, you may sometimes doubt whether the source is correct (this is the internet after all) but even if this is not an issue, you may notice the highlighted section in the source does not support the generated answer! Another way of saying this is that the citation for the generated text is not valid. This is why you should always verify the answer. </p>
<p>Studies of similar tools like Bing Chat, Perpexity (e.g., <a href="https://arxiv.org/abs/2304.09848">Evaluating Verifiability in Generative Search Engines</a>) show that depending on the type and difficulty of question given, the error rate can be high. In that particular study, when analysing a variety of questions that were challenging and open-ended, Bing Chat only had valid citations to support 58.7% of its generated sentences. </p>
<p>There are other features like “<a href="https://blog.google/products/search/google-search-generative-ai-learning-features/#:~:text=See%20definitions%20within%20AI%2Dgenerated%20responses">See definitions within AI-generated responses</a>”, <a href="https://blog.google/products/search/generative-ai-search/#:~:text=you%E2%80%99ll%20get%20a%20snapshot%20of%20noteworthy%20factors%20to%20consider%20and%20products%20that%20fit%20the%20bill.%20You%E2%80%99ll%20also%20get%20product%20descriptions%20that%20include%20relevant%2C%20up%2Dto%2Ddate%20reviews%2C%20ratings%2C%20prices%20and%20product%20images.%20That%E2%80%99s%20because%20this%20new%20generative%20AI%20shopping%20experience%20is%20built%20on%20Google%E2%80%99s%20Shopping%20Graph%2C">a special shopping mode that will give “a snapshot of noteworthy factors to consider and products that fit the bill.” based on the Google’s Shopping Graph</a> and the ability to ask follow-up questons.</p>
<h2>What are some differences between the SGE experience and Bing Chat or similar implementations</h2>
<p>For many search queries, with the SGE experience turned on it behaves like many other similar implementations you see in Perplexity, Bing Chat.</p>
<p>It will briefly flash a “generating...” notice and then very quickly generate a direct answer with links/citations.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge2.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>However, this does not always happen. I have seen two other behaviors.</p>
<p>Sometimes it does not automatically generate an answer but displays a message “Get an AI-powered overview for this search” with a “Generate” button that you have to push before it generates an answer. </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge3.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>For <a href="https://www.google.com/search?q=advanced+techniques+for+retrieval+augmented+generation+%28rag%29&sca_esv=582537645&rlz=1C1YTUH_enSG1024SG1047&biw=1422&bih=724&sxsrf=AM9HkKm9dV2ZzaDRCYWAHfDB5ufr8mZWog%3A1700033123525&ei=Y3JUZcW3H76YseMPmfS_8A0&ved=0ahUKEwiF9bvzvMWCAxU-TGwGHRn6D94Q4dUDCBA&uact=5&oq=advanced+techniques+for+retrieval+augmented+generation+%28rag%29&gs_lp=Egxnd3Mtd2l6LXNlcnAiPGFkdmFuY2VkIHRlY2huaXF1ZXMgZm9yIHJldHJpZXZhbCBhdWdtZW50ZWQgZ2VuZXJhdGlvbiAocmFnKTIFECEYoAFIpSxQ1gpYkypwAngBkAEAmAGaAaABkwWqAQM2LjG4AQPIAQD4AQHCAgoQABhHGNYEGLAD4gMEGAAgQYgGAZAGCA&sclient=gws-wiz-serp">some search queries</a>, it does not even provide an option to generate the “AI-powered overview”. </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge4.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>This is unlike Bing Chat and others which will almost always generate some kind of answer.</p>
<p>Clearly Google is being very cautious here and choosing to not show generated answers for all questions.</p>
<p>We do not know what criteria it is using to decide when to automatically generate an answer, offer to generate an article or even choosing not to answer (though a little testing suggests the longer the search query the less likely it will automatically generate an answer).</p>
<p>However, <a href="https://static.googleusercontent.com/media/www.google.com/en//search/howsearchworks/google-about-SGE.pdf">we are told</a></p>
<p style="margin-left: 30px;">There are topics for which SGE is designed not generate a response. For some of the topics, there might simply be a lack of quality or reliable information available on the open web. For these areas – sometimes called “data voids” or “information gaps” – where our systems have a lower confidence in our responses, SGE aims not to generate an AI-powered snapshot.</p>
<p>And like all generative AI based systems, there is ring fencing so the model will not generate answers for “explicit or dangerous topics” such as topics on self-harm. There are also explicit disclaimers when generating answers relating to medical and health. </p>
<p style="margin-left: 30px;">SGE quality standards are higher when it comes to generating responses about certain queries where information quality is critically important. On Search, we refer to these as “Your Money or Your Life” (YMYL) topics – such as finance, health, or civic information – areas where people want an even greater degree of confidence in the results.</p>
<p>All in all, while its rivals like Bing Chat and Perplexity.ai do include AI safety features, Google’s SGE appears to be even more conservative. Given the immense reach and influence of Google Search, I applaud this cautious and careful approach.</p>
<h2>What’s the difference between Google SGE and Bard? </h2>
<p>On the surface, Google SGE and Bard seem very similar. But there are a few major differences.</p>
<p>It seems trite to say this, but Google SGE is designed to be a search engine first and foremost and is less like a chatbot (though it can sometimes work that way since you can ask follow-up questions). </p>
<p><a href="https://static.googleusercontent.com/media/www.google.com/en//search/howsearchworks/google-about-SGE.pdf">Google states that</a></p>
<p style="margin-left: 30px;">when it comes to generating responses about certain queries where information quality is critically important. On Search, we refer to these as “Your Money or Your Life” (YMYL) topics – such as finance, health, or civic information – areas where people want an even greater degree of confidence in the results</p>
<p>Moreover</p>
<p style="margin-left: 30px;">we were intentional in constraining conversationally. What this means, for example, is that people might not find conversational mode in SGE to be a free-flowing creative brainstorm partner — and instead find it to be more factual with pointers to relevant resources. </p>
<p>They mention that <a href="https://static.googleusercontent.com/media/www.google.com/en//search/howsearchworks/google-about-SGE.pdf">when models are tuned to be more fluid in their responses they tend to make more errors. Also, the friendly human conservational tone can make human evaluators more likely to trust the answer and more likely to miss mistakes</a>.</p>
<p>Indeed, comparing Google SGE with Bing Chat or Bard, I found that Google SGE is definitely not meant to be a chatbot. </p>
<p>For example, unlike Bard or Bing Chat, you cannot have a casual conversation about life and the weather. See below a conversation with Bard.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge5.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Trying to chat with Google SGE gets an odd literal answer.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge6.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Interestingly, while it typically cannot converse causally with you, this does not mean Google SGE cannot follow instructions and do various tasks. For example, when I ask it to convert a certain phrase into Singlish and convert a string of characters all into upper case, it does so readily. </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge7.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Though typing in natural language seems fine, I would not recommend trying to do <a href="https://www.promptingguide.ai/">prompt engineering</a> e.g., with <a href="https://machinelearningmastery.com/what-are-zero-shot-prompting-and-few-shot-prompting/">few/multiple shot prompting</a> as <strong>there is a limit of 32 words to query</strong>.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge8.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>This is a huge limitation compared to using ChatGPT, Bard or even Bing Chat so you cannot do common use cases like asking it to summarise a chunk of text.</p>
<p>This further reinforces the idea that you are using a search engine where you type in a limited number of keywords or words rather than give long instructions. </p>
<h2>Didn’t Google already give direct answers to some search queries in the past? </h2>
<p>Yes, even without SGE turned on, regular Google searches will occasionally give you direct answers to questions.</p>
<p>For example, even without SGE, if you ask certain factual questions, Google is able to produce a direct answer using information from the <a href="https://support.google.com/knowledgepanel/answer/9787176?hl=en">Google Knowledge Graph</a>. </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge9.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Notice in the example above, I also have the option to generate an “AI-powered overview” because I have SGE option switched on. If I click “generate” I will in fact get two direct answers – one from SGE , one from the Knowledge Graph.</p>
<p>The other type of direct answer you may answer come from a feature called “<a href="https://support.google.com/websearch/answer/9351707?hl=en-SG&visit_id=638356377962659320-1046807949&p=featured_snippets&rd=1">featured snippet</a>”. </p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge10.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>Both are independent of the SGE feature and it is possible you can see both SGE AND Google Knowledge Graph/Featured Snippet results for the same query.</p>
<figure class="caption caption-img"><span><img alt="NA" class="img-fluid" src="https://library.smu.edu.sg/sites/library.smu.edu.sg/files/topics-insights/Nov-2023/sge-14.png" style="border: 1px solid rgb(238, 238, 238);" /></span></figure>
<p>More information about Google SGE</p>
<ul>
<li><a href="https://static.googleusercontent.com/media/www.google.com/en//search/howsearchworks/google-about-SGE.pdf">A new way to search with generative AI – an overview of SGE</a></li>
<li><a href="https://blog.google/products/search/generative-ai-search/">Supercharging Search with generative AI</a></li>
<li><a href="https://blog.google/products/search/google-search-generative-ai-august-update/">3 new things you can do with generative AI in Search</a></li>
<li><a href="https://blog.google/products/search/google-search-generative-ai-learning-features/">Learn as you search (and browse) using generative AI</a></li>
<li><a href="https://blog.google/products/search/google-search-generative-ai-international-expansion/">Generative AI in Search expands to more than 120 new countries and territories</a> </li>
</ul>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.com0tag:blogger.com,1999:blog-4727930222560708528.post-57709274262174891242023-11-23T05:21:00.012+08:002024-02-14T00:28:17.335+08:00JSTOR generative AI pilot - Or is Semantic Search coming for academic databases? <p>A decade ago in 2012, I observed how <a href="https://musingsaboutlibrarianship.blogspot.com/2012/05/how-is-google-different-from.html">the dominance of Google had slowly affected how Academic databases and OPACs/ catalogues (now discovery services) work.</a></p><p>In a nutshell I argued that due to Google's influence, academic search at the time had already moved towards ranking results by relevancy by default (as opposed to sorting by date or <a href="https://libanswers.liverpool.ac.uk/faq/181287#:~:text=An%20accession%20number%20is%20a,chronological%20order%20of%20its%20acquisition.">accession number</a>), adopting implicit AND (as opposed to requiring strict AND) and was slowly moving toward auto- stemming by default and searching over full-text (as opposed to just abstracts, keywords).</p><p>This has mostly happened. But I also predicted</p><p><span face=""Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif" style="background-color: white; color: #333333; font-size: 14px; text-align: justify;"><i></i></span></p><blockquote><i><span face=""Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif" style="background-color: white; color: #333333; font-size: 14px; text-align: justify;">But unless a miracle happens, will NEVER - </span><span face="Helvetica Neue Light, HelveticaNeue-Light, Helvetica Neue, Helvetica, Arial, sans-serif" style="color: #333333;"><span style="font-size: 14px;">Do a "soft AND", where occasionally search terms might be dropped </span></span></i></blockquote><span face="Helvetica Neue Light, HelveticaNeue-Light, Helvetica Neue, Helvetica, Arial, sans-serif" style="color: #333333;"><span style="font-size: 14px;"></span></span><p></p><p>Here what I was trying to say is that there is little chance of what is often called keyword or lexical based + strict or mostly strict Boolean search (typically implemented with BM25/TF-IDF) going away.</p><p>The alternative at the time was non-boolean search algos but that were mostly still keyword or lexical based (e.g. non-boolean vector space search), but these days we are talking about semantic search which obviously isn't boolean.</p><p>However, I suspect things are now changing and this blog post I will explain why. I will also do a brief overview of <a href="https://www.jstor.org/generative-ai-faq?typeAccessWorkflow=login">JSTOR's generative AI pilot</a> as of Nov 2023 where they introduced an experimental search that "understands your query and provides more relevant results, even if you don't use the exact words.", clearly a semantic search as an alternative to the traditional search labelled "keyword-based results"</p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEh4HDDwkXd8dDeWVGyaCPRMZiSB4n4sm1ND0OQtowE7TW3U6R5GzBP4GVqZ6OinHRfku9OqDgmt8_YuMUt0wRfGwN7_1xWZGeT8sf68genQ2ECx4FUNgcv1ak3Wk0btbu2PIyaToDjEF2FzFUEXSIrrCkvoa_lqG_fye-B5pMETcvvs6CoXlE3eK4XLBPzt" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="122" data-original-width="635" height="122" src="https://blogger.googleusercontent.com/img/a/AVvXsEh4HDDwkXd8dDeWVGyaCPRMZiSB4n4sm1ND0OQtowE7TW3U6R5GzBP4GVqZ6OinHRfku9OqDgmt8_YuMUt0wRfGwN7_1xWZGeT8sf68genQ2ECx4FUNgcv1ak3Wk0btbu2PIyaToDjEF2FzFUEXSIrrCkvoa_lqG_fye-B5pMETcvvs6CoXlE3eK4XLBPzt=w640-h122" width="640" /></a></div><br /><br /><p></p><p>My initial findings are that the new experimental semantic search relevancy rankings seem to be really good, outperforming the traditional status quo keyword search most of the time. There might even be some indications the results are even better if you search in natural language style than keyword style!</p><p>JSTOR Pilot also introduces some novel ideas on how to mitigate the downsides of the unpredictability of Semantic Search by leveraging language models to ask "how is <query> related to this text" over the full-text!</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbMTln0zz3o45hAp-Fv3xvf6idPCK_K22QfNb_xxP3wE0aVxRQ0WmPWzQmCEAaTqj5QMaSbC_bCQhHl6jjuUtHPm5me9aTUQHMucTLomtMGtYNtUHAeMhnIkH9F4AY9WgwGOAqQZVtrpNdAmS-XbKTFjNWLBw0CX_qXyttxJS9hc0U48XQP6p4r3fgGzkC/s737/aijstr2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="737" height="84" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbMTln0zz3o45hAp-Fv3xvf6idPCK_K22QfNb_xxP3wE0aVxRQ0WmPWzQmCEAaTqj5QMaSbC_bCQhHl6jjuUtHPm5me9aTUQHMucTLomtMGtYNtUHAeMhnIkH9F4AY9WgwGOAqQZVtrpNdAmS-XbKTFjNWLBw0CX_qXyttxJS9hc0U48XQP6p4r3fgGzkC/w640-h84/aijstr2.png" width="640" /></a></div>I end with a discussion about the irony of how LLM powered search tools like elicit.com, scispace might be bringing us back to the old days of Dialog where searches were charged per use/time and there was a need to be efficient with nested Boolean strategies. Except these are also the tools most likely not to support Boolean! <br /><p><br /></p><h2 style="text-align: left;">Introduction</h2><p>In my talks about generative AI/Large Language models, I have this standard slide where I assert the three typical ways academic search engines are incorporating the benefits of LLMs. </p><p><br /></p><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIv6h2qbcGfgMlEGD6z8Vj2299fm8PVANCzaZuOMNx52mW4C0LBy61DaqyPDz45T825p_pd07NfcKH1nBL17nvwUl9fZSYSLBNvFRum1pmWGhBVUSAoe3uaxlv8wsXlZyB5XTFWApGlpKJ4AYGqf05rq8cQS34j2U0qel0CLyavOGIiIZt0QOsH0BkeBlw/s1007/semanticsearch.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="500" data-original-width="1007" height="318" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIv6h2qbcGfgMlEGD6z8Vj2299fm8PVANCzaZuOMNx52mW4C0LBy61DaqyPDz45T825p_pd07NfcKH1nBL17nvwUl9fZSYSLBNvFRum1pmWGhBVUSAoe3uaxlv8wsXlZyB5XTFWApGlpKJ4AYGqf05rq8cQS34j2U0qel0CLyavOGIiIZt0QOsH0BkeBlw/w640-h318/semanticsearch.png" width="640" /></a></div>Of these three benefits, the middle one -<a href="https://www.youtube.com/watch?v=M3I7XOHY31k"> Generated of Direct answers with citations (using RAG or retrieval augmented generation)</a> is the one that gets all the attention because it is the most eye-catching feature and promises to disrupt the decades old paradigm of showing you N top documents that might answer your query.<div><blockquote><i>The third benefit, extraction of information from papers can also affect relevancy if it is used pre-search for subject or term extraction. For example, <a href="https://papyrus.bib.umontreal.ca/xmlui/handle/1866/28262?s=09">PubMed seems to be doing auto-indexing of MeSH headings since April 2022, and the inaccuracy of the extraction might be affecting reliability of search queries!</a></i></blockquote><p>Vendors seem to agree, and pretty much <a href="https://musingsaboutlibrarianship.blogspot.com/p/list-of-academic-search-engines-that.html">everyone from Dimensions to Scopus AI to experiments by Exlibris all seem to be rushing to add this feature at least.</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjo8r6lx1IB30ZuKqTHD-iId2DQptV1l06IE2pdzgf_kYQVQ1huFaVHWY4oN_HKP-mKYWF0v6ILUwQ166LCd7vIkBPgglWte3sz99h17wEdL5x7xm4tuBRAofZIGZGrVTScWODKvSaNxdLQy9aRBNshEsX34IcyDWvBMRl0U35JAaWsXaYcHh4ixpQeslq/s1237/exlibris-prototype.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="715" data-original-width="1237" height="370" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhjo8r6lx1IB30ZuKqTHD-iId2DQptV1l06IE2pdzgf_kYQVQ1huFaVHWY4oN_HKP-mKYWF0v6ILUwQ166LCd7vIkBPgglWte3sz99h17wEdL5x7xm4tuBRAofZIGZGrVTScWODKvSaNxdLQy9aRBNshEsX34IcyDWvBMRl0U35JAaWsXaYcHh4ixpQeslq/w640-h370/exlibris-prototype.png" width="640" /></a></div><div style="text-align: center;"><a href="https://knowledge.exlibrisgroup.com/@api/deki/files/155939/AI_-_Generative_AI_and_Discovery.pdf?revision=1"><i>Exlibris prototype search that generates a direct answer using abstracts</i></a></div><p style="text-align: center;"><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL-r2sKfqmDt-DoZlVck9fg2QDj_FFCl5V4QBHkkeT0MsaLf7fMS1NsB4s05czDDJnfJWuvJkhFjMR6OIiD3wTbRXaZ6Hn73-MKsf9lMO8DUZgiH-3_dmkrfIMjvlsyKczbLOFEp1f1mKubdBioTG_pkdIPxb23jQ6fSAVCD9JypBRwan_rnMNnVttv2YM/s1097/rag-nov2023.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="562" data-original-width="1097" height="328" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjL-r2sKfqmDt-DoZlVck9fg2QDj_FFCl5V4QBHkkeT0MsaLf7fMS1NsB4s05czDDJnfJWuvJkhFjMR6OIiD3wTbRXaZ6Hn73-MKsf9lMO8DUZgiH-3_dmkrfIMjvlsyKczbLOFEp1f1mKubdBioTG_pkdIPxb23jQ6fSAVCD9JypBRwan_rnMNnVttv2YM/w640-h328/rag-nov2023.png" width="640" /></a></div><div style="text-align: center;"><a href="https://musingsaboutlibrarianship.blogspot.com/p/list-of-academic-search-engines-that.html"><i>See my updated list of Academic Search engines using Retrieval Augmented Generation</i></a></div><div style="text-align: center;"><br /></div><p>It's less clear how many <a href="https://musingsaboutlibrarianship.blogspot.com/2023/04/did-you-know-how-embeddings-from-state.html">academic search engines are moving away from traditional keyword based searches towards more semantic type searches using embedding</a> though the fact that many of them work well even if you type with long natural language queries is suggestive of what is happening (though not 100% conclusive). </p><blockquote><p>Also just because a search system can generate a direct answer does not imply it is not using a keyword-based system to find documents. You could in theory find relevant docs using Boolean+TF-IDF for the retriever part of RAG. That said many such systems are already heavily using large language models for generation and summarization/extraction, it seems unlikely they are <a href="https://weaviate.io/blog/hybrid-search-explained">just using traditional TF-IDF, BM25 only methods (also known as sparse embeddings) as opposed to dense embeddings of some kind (eg. BERT, MiniLM-L6-v2,text-embedding-ada-002 etc) </a></p></blockquote><p>The line between keyword based (typically implemented by <a href="https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables">TF-IDF/BM25</a>) and semantic search (these days using some form of embeddings) can be a bit difficult to define so below I try to entangle this.</p><p>Experts in information retrieval can skip this long section to <a href="#jstorgenai">my overview of JSTOR generative AI pilot.</a></p><h2 style="text-align: left;">What do we mean by keyword-based or lexical search vs Semantic search?</h2><div><i>Post edit note: I took a long time to write this long section but am still not totally happy with it. I later wrote on Medium another piece -<a href="https://medium.com/p/95eb503b48f5"> Boolean vs Keyword/Lexical search vs Semantic — keeping things straight</a> which I am happier with articulating the differences between Boolean, Keyword/lexical search and Semantic Search.</i></div><p>The <a href="https://en.wikipedia.org/wiki/Boolean_model_of_information_retrieval#:~:text=The%20BIR%20is%20based%20on,documents%20contain%20the%20query%20terms.">Boolean model of information retrieval (BIR) </a>was proposed in the 50s and this is the model most librarians have in their minds, but the BIR model has huge drawback in that it is strictly a binary model. Either the document retrieved is relevant or it is not and hence under this model of Information retrieval the idea of relevancy ranking does not exist. </p><p><i>Learning point: Though we in the library world tend to use Boolean search and keyword search (lexcal search) interchangeably, it is important to realize that you can have keyword search or lexical search and not have strict Boolean!</i></p><p><i>For example, a TF-IDF algo (see later) may not necessarily follow Boolean search restrictions completely. Other times a search engine may be mostly following Boolean but if certain conditions are met (e.g. low number of results), the behavior might change (e.g. start stemming, expand synonyms) </i></p><p><i>For a real word example, my suspicion is that Google Scholar doesn't have a strictly Boolean search, but it is lexical, or keyword based because it's algo is still mostly based on comparing and matching keywords in query to documents. </i></p><p>Traditional classical keyword or lexical search tends to drop stop words and not take into account order of words using a bag of words approach.</p><p>This is opposed to semantic (occasionally called neural search) where matching is on the concept or meaning level and the current state of art techniques even considers order of words. </p><p><br /></p><h3 style="text-align: left;">Ranking the results</h3><p>Today it is unthinkable for academic searches today not to have relevancy ranking, this is where TF-IDF or Term frequency and inverse document frequency comes into play and provides different weights for matching each query term depending on TF and IDF.</p><p>Note: there are a few ways to calculate IDF, below shows one way based on log function.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoxqt9J4txeDLQYITQi8GfZz5YmGyOKdAkbqNO4G-uV4O_KdYDNi2RuGh2hQon5LCAPWXRLBd7egvr9iiRbVI38E97r93uX7iTncrSYvob0NPzcY5mjvr3rbOQ117XB7pCfAz071ooziElbhdN63m_sJsEcUf8r-GnCQqEBPlNu6fgqX4ahR_wK_MfPrVN/s1167/tf-idf.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="576" data-original-width="1167" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgoxqt9J4txeDLQYITQi8GfZz5YmGyOKdAkbqNO4G-uV4O_KdYDNi2RuGh2hQon5LCAPWXRLBd7egvr9iiRbVI38E97r93uX7iTncrSYvob0NPzcY5mjvr3rbOQ117XB7pCfAz071ooziElbhdN63m_sJsEcUf8r-GnCQqEBPlNu6fgqX4ahR_wK_MfPrVN/w640-h316/tf-idf.png" width="640" /></a></div><br /><p>The idea is intuitive enough, if a query keyword that is matched appears in a document a lot (term frequency is high), the document gets higher ranked.</p><p>However not all terms are equally important. A common term that appears in a lot of documents (low inverse document frequency) would be less important to match than a specific term that rarely appears in documents (high inverse document frequency). This is represented by the inverse document frequency. </p><p><i></i></p><blockquote><i>Extremely common words like "the", "is", "are" would of course have extremely low inverse document frequency since they appear in all documents. As such classical methods of information retrieval using "bag of word" type methods tend to filter them out as "Stop words". As you will see later more advanced techniques do not filter Stop words because they are able to consider order of words even for common words. </i></blockquote><p></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinXljMaLdYF112HJDvBTmtTrgXQ-gvvDGqR9hDP7yJufHlLE3OifeEo8cK-URMvC_etwFfeXS4eAsbkD4BcoigNNnxPiTa8_lJ6LWHVwqsCRnClpcodm677Dob2ElL1TxZhNIoMabdX0pOfmkbFZB4j27ZHZbvPRdAshqvKqFvel5vaw00EZ-poVvKXErY/s792/df.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="669" data-original-width="792" height="540" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEinXljMaLdYF112HJDvBTmtTrgXQ-gvvDGqR9hDP7yJufHlLE3OifeEo8cK-URMvC_etwFfeXS4eAsbkD4BcoigNNnxPiTa8_lJ6LWHVwqsCRnClpcodm677Dob2ElL1TxZhNIoMabdX0pOfmkbFZB4j27ZHZbvPRdAshqvKqFvel5vaw00EZ-poVvKXErY/w640-h540/df.png" width="640" /></a></div><br /><p>As the graph above shows, the more common a word can be found across all the documents (set to be 1000 in this case), the lower the score. At 1000, where every document has the word, the inverse doc frequency is at 0. </p><p>You can now get a TF-IDF score by multiplying both TF and IDF together,</p><p><span><i></i></span></p><blockquote><i><span>You may be wondering that the tf-idf formula calculates one tf-idf score for each term. So Doc1 might have a tf-idf of 0.03 for <keyword1> , a tf-idf of 0.83 for </span><keyword2> and so on, but how do you combine them together? We will cover it in the next section on vector state models</i></blockquote><p></p><p>BM25 is nowadays almost always used over TF-IDF, <a href="https://guillim.github.io/datascience/2020/08/11/TFIDF-BM25.html">because it is a more refined version of TF-IDF as it takes into account length of document being matched, so longer documents in the set don't have an advantage.</a></p><p>However, TF-IDF/BM25 is technically a statistical model rather than a Boolean model so it can rank documents higher even if the document does not have all the query terms.</p><p>Today, many common search systems including academic ones based on <a href="https://www.elastic.co/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries">elastic search combine the two methods - first doing a boolean match then ranking results based on TF-IDF/BM25</a> giving the best of both worlds, and this is what librarians expect to see.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXKsqr7tQf46GvMmwsXR98DeZhxV6hiEaYq8DCEsfKQcCrtrPPnfizkWXIhk6366GoBAz6IV3zicWg6VMpYvizIUa812FRGP4_E61_GeBEExSsYY6VC5UmkfpYCt9j6wsDOXp9OCksYka1um9arigcCC3NvUH-chlre8Apo-cq0YHp7z3-jmc1fMbwbzOs/s1019/elasticsearch.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="485" data-original-width="1019" height="304" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgXKsqr7tQf46GvMmwsXR98DeZhxV6hiEaYq8DCEsfKQcCrtrPPnfizkWXIhk6366GoBAz6IV3zicWg6VMpYvizIUa812FRGP4_E61_GeBEExSsYY6VC5UmkfpYCt9j6wsDOXp9OCksYka1um9arigcCC3NvUH-chlre8Apo-cq0YHp7z3-jmc1fMbwbzOs/w640-h304/elasticsearch.png" width="640" /></a></div><div class="separator" style="clear: both; text-align: center;"><i><a href="https://www.elastic.co/blog/how-to-improve-elasticsearch-search-relevance-with-boolean-queries">How documents are ranked in Elasticsearch</a></i></div><p><br /></p><h3 style="text-align: left;">The predictability of Boolean/Keyword searches</h3><p>I would guess most librarians when they say, "keyword search system" they operationalize it in their minds as a system that generates results that "make sense" or "can be explained". Such systems should be predictable enough to </p><p>a) allow a searcher to "explain" why a result was included or excluded based on search queries used</p><p>b) predict in advance when comparing two search strings, which should get equal or more results. </p><p>That said this isn't always 100% true because pure keyword systems rarely exist today, and academic searches may sometimes give "results that make no sense" because of additional "smart" features like <a href="https://en.wikipedia.org/wiki/Query_expansion">search query expansion (typically rule-based)</a> e.g., synonym matching that may trigger on different conditions.</p><p></p><blockquote>There has always been a debate about how "helpful" search engines should be, e.g., <a href="https://www.youtube.com/watch?v=V-ZhzkDsVAM">PubMed ATM (auto term mapping) </a>features and when they should be invoked and how to indicate it is happening</blockquote><p></p><p>For example, years ago I remember a ruckus in a mailing list for a library discovery service where a librarian found that a search that was (A OR B) AND C resulted in fewer results than (A AND B AND C). </p><p>Eventually the answer given for this behavior was that the system had a rule where if the number of results fell below a certain threshold, it would automatically do stemming, and this can of course increase results. </p><p>There can be dozens if not hundreds of such additional rules that trigger depending on the situation that can occasionally make the search results less explainable. </p><p>Still at the end of the day, academic search systems even Google Scholar (but not Google as we shall see) are still mostly predictable and clearly lexical search systems based on matching keywords (though those keywords might be stemmed etc.).</p><h3 style="text-align: left;"><br /></h3><h3 style="text-align: left;">Vector Space and rise of Embeddings (2010s)</h3><p>Over the years, some search systems have tried other non-Boolean methods, like ones based on <a href="https://machinelearningmastery.com/a-gentle-introduction-to-vector-space-models/">Vector Space models</a>.</p><p>The idea here is you represent terms and documents as vectors in a multi-dimensional space and use a function like Cosine similarity to calculate the similarity between them. </p><p>One way to think about this is first imagine your universe consists of our two terms "dog" and "cat".</p><p>Each document can then be represented by two numbers X1, Y1, which represents how "dogish" the document is and how "catish" the document is.</p><p>For now, let's imagine X1 is just the number of times the word "dog" appears and Y1= number of times the word "cat" appear.</p><p>You can then plot both documents on a graph that is two dimensional as shown below.</p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVWx61c7Mi79NzA_WLFpO9O4MzKrQHl7N9JSnIUBIuegsZEdoeLlr1TmTSEe1H_TgsXgwZpAxqJE6qSdeLgFXVmYrQQ59MPClIj9OZ-PiG_cToaSRBxms05uNqfZ1afZ9yPFeZ2rAw17V5LTbDM3xScJwzWMev_9nHvhQqeP3FzqJPNWjPfwbELUMW2nQW/s487/dogvector.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="278" data-original-width="487" height="366" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVWx61c7Mi79NzA_WLFpO9O4MzKrQHl7N9JSnIUBIuegsZEdoeLlr1TmTSEe1H_TgsXgwZpAxqJE6qSdeLgFXVmYrQQ59MPClIj9OZ-PiG_cToaSRBxms05uNqfZ1afZ9yPFeZ2rAw17V5LTbDM3xScJwzWMev_9nHvhQqeP3FzqJPNWjPfwbELUMW2nQW/w640-h366/dogvector.png" width="640" /></a></div><div>You can see that Doc 2, is more "Catish" because it has a lot of "Cat" in the document and relatively little "Dog" and vice versa for Doc 1.</div><br /><p><br /></p><div class="separator" style="clear: both; text-align: center;"> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjFgZqtk1W3kR_RZQKXYduLIAevwwLGWCTgnOfyxgXxC6VTna9H-A3UcSpjKpNSwYv7itL35I_RzEJaFsTbBpSxIMafe5HG0rkFFmdi8fZ8bjYWURezuRMlfkuOpr0QMfIYkMP5mNjk-_hG8C15oDWy-ETTWEciYhQpBd0BsqtLnLow5zpdjSjGc7wT4Z3/s529/vectorspace.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="454" data-original-width="529" height="344" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjFgZqtk1W3kR_RZQKXYduLIAevwwLGWCTgnOfyxgXxC6VTna9H-A3UcSpjKpNSwYv7itL35I_RzEJaFsTbBpSxIMafe5HG0rkFFmdi8fZ8bjYWURezuRMlfkuOpr0QMfIYkMP5mNjk-_hG8C15oDWy-ETTWEciYhQpBd0BsqtLnLow5zpdjSjGc7wT4Z3/w400-h344/vectorspace.png" width="400" /></a></div><p>The search queries can of course be represented in the same way. A technical term is to call these representations a "vector".</p><p>The trick now is to ask for a given query above, which document (vector) is the closest or most similar to the query (vector). In the above image should the query retrieve Doc 1 or Doc 2? </p><p>A simple idea is to figure out whether Doc 1 or Doc 2 is more similar to the query is to look at the size of angles between them. By calculating the cosine between the angles of the query and each doc, one will get a score from 0 to 1. If the angle between the two vectors gets smaller and smaller, the cosine of the angle will approach one and if the angle becomes bigger and bigger it approaches zero. At 90 degrees, similarity drops to the lowest point or 0. </p><p>You may be wondering; this isn't a realistic example since documents have far more than two words. The beauty of this idea is you can extend this to more than just two terms or even three terms. If there are N terms you are representing, you can go into "N dimensional space". While human minds can't visualize more than three dimensions, math works the same!</p><p>Below shows the actual formula. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsEl3yg-5RjmbRFuB_5MYnMr__cUmuB8NdmaMMKYbDhyphenhyphenxbzxv80JF7kpyRwWqZlkLBCU3D240weVDRkmN9JjYVwrRaNX-N00rpsrkJLqLfZXXKZOFkYoIhxOdiwMOe34dITY-8fxanwZX5mZWXU4BwlZ0PxvO1WOLcB48WaPm1uUItffH0H4rVueZXQixm/s696/cosinediff.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="201" data-original-width="696" height="184" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhsEl3yg-5RjmbRFuB_5MYnMr__cUmuB8NdmaMMKYbDhyphenhyphenxbzxv80JF7kpyRwWqZlkLBCU3D240weVDRkmN9JjYVwrRaNX-N00rpsrkJLqLfZXXKZOFkYoIhxOdiwMOe34dITY-8fxanwZX5mZWXU4BwlZ0PxvO1WOLcB48WaPm1uUItffH0H4rVueZXQixm/w640-h184/cosinediff.png" width="640" /></a></div><p>Below shows an example of documents represented with multiple dimensions (one for each word) using a document-term matrix.</p><p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdqCS9F_pquOMcMMZKwfpGiaYAtVR2nAtmdE-B_jmX-OLdZsmAaPtn7swZXFug8rrb7WgpZWZ73PB7Pigxbt6UZCqQnrj3NPDfqTDeI9ratTOrPUEDtu_qIE68XbHwVDwqIosF04BREriaYsTGdrcKQib623JU6e-BdQjPPX26Fia3JjxCnWCFq6kYBInF/s682/dtm.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="236" data-original-width="682" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdqCS9F_pquOMcMMZKwfpGiaYAtVR2nAtmdE-B_jmX-OLdZsmAaPtn7swZXFug8rrb7WgpZWZ73PB7Pigxbt6UZCqQnrj3NPDfqTDeI9ratTOrPUEDtu_qIE68XbHwVDwqIosF04BREriaYsTGdrcKQib623JU6e-BdQjPPX26Fia3JjxCnWCFq6kYBInF/w640-h222/dtm.png" width="640" /></a></p><p style="text-align: center;">Document-term matrix</p><p style="text-align: left;">In the above matrix, every row represents a document, and each column represents a word. There are many ways to fill in the table, but in the above example, if a document such as newsarticle-119 has one or more words for the column it will be filled with "one", otherwise it will be "zero". But this seems a little crude, can we do better?</p><p style="text-align: left;">One could instead enter the number of times each word appears in the document (term frequency) but this does not consider how commonly the word is used across all documents (document frequency).</p><p>As such a popular way to improve on this method is to calculate a TF-IDF score and fill the cells with the TF-IDF score. This allows each cell to be not only a real number but also considers the weights provided by TF-IDF.</p><p><i></i></p><blockquote><i><b>Technical note</b>: Though BM25 can be seen as an evolution of TF-IDF, strictly speaking the BM25 is part of the <a href="https://www.uni-mannheim.de/media/Einrichtungen/dws/Files_People/Profs/goran/5-Probabilistic-Retrieval-FSS20.pdf">probabilistic information retrieval models</a> and not <a href="https://en.wikipedia.org/wiki/Vector_space_model#:~:text=Vector%20space%20model%20or%20term,retrieval%2C%20indexing%20and%20relevancy%20rankings.">Vector Space models</a> unlike TF-IDF. Probabilistic information retrieval models rank documents based on the probability that the document is relevant to the query. </i></blockquote><p></p><blockquote><i>BM25, specifically, is an extension of the Binary Independence Model (BIM), which is one of the earliest probabilistic models used in information retrieval. BIM operates on the assumption that the presence or absence of each term in a document is independent of the presence or absence of any other term in the document, given the relevance of the document. BM25 refines this approach by incorporating term frequency (how often a term appears in a document) and document length (the length of the document in terms of the number of words) into its relevance scoring formula. It also uses inverse document frequency (IDF) to account for the fact that some terms are more informative than others across the document collection.</i></blockquote><blockquote><i>BM25 can be considered within the broader context of the vector space model (VSM) framework, especially when we think about how documents are represented and compared to queries. However, the theoretical underpinning of BM25 itself is rooted in the probabilistic retrieval model and fundamentally differs from the traditional vector space model in its approach to document scoring and ranking. The ranking function of the probabilistic models is grounded in probability theory, while the ranking function of Vector state models — cosine similarity — is grounded in vector algebra</i></blockquote><p><br /></p><p>Clearly this method has disadvantages. Currently each document will have to be represented by a long series of numbers (one for each word in your collection!), this is clearly inefficient. </p><p>Is there a way to compress the same information in a shorter series of numbers? For example, you could imagine if your document has the words "cat", "feline" instead of having one column for each word or representing the document with two separate numbers, just use one instead since they mean roughly the same thing! If there is a automatic way to do this, the columns would represent meaning or semantics as opposed to just words.</p><p>Methods like <a href="https://www.datacamp.com/tutorial/discovering-hidden-topics-python">Latent semantic indexing (LSI)/ Latent semantic analysis (LSA) </a>in fact try to do this by trying to uncover hidden or "latent" meaning/groupings in your documents.</p><p>However, my understanding is that it was only after 2013 with the invention of techniques like <a href="https://medium.com/analytics-vidhya/word-embeddings-in-nlp-word2vec-glove-fasttext-24d4d4286a73">Word2Vec/GLove etc </a> that was the first sign that semantics or meaning could be captured and represented by what is now known as embeddings vectors which is also a series of numbers</p><p>The main difference is compared to the earlier vector space models used the words directly to represent as vectors, these newer models use embeddings or representations of words/tokens. How are these embeddings learnt? Though self-supervised machine learning of a lot of text as we shall see.</p><p></p><blockquote><p>In today's context, words are not used directly but<a href="https://www.analyticsvidhya.com/blog/2020/05/what-is-tokenization-nlp/"> text are converted first into tokens, a process known as tokenization</a>. There are various ways to do tokenization such as <a href="https://www.youtube.com/watch?v=nhJxYji1aho">Word Tokenization</a>, <a href="https://www.youtube.com/watch?v=zHvTiHr506c">Subword Tokenization (currently state of art)</a> and even <a href="https://www.youtube.com/watch?v=ssLq_EK2jLE&t=2s">character based Tokenization</a>. <a href="https://huggingface.co/docs/transformers/tokenizer_summary">Hugging face has good tutorials.</a></p></blockquote><p>Like earlier Vector space models, you express both the query and documents as a vector embedding and <a href="https://developers.google.com/machine-learning/clustering/similarity/measuring-similarity" style="color: #009eb8; display: inline; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; outline: none; text-align: justify; text-decoration-line: none; transition: color 0.3s ease 0s;">you can calculate how similar embeddings are using various similarity measures such as cosine or dot product</a><span face=""Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif" style="background-color: #fafafa; color: #333333; font-size: 14px; text-align: justify;">.</span> </p><p> But what makes us think the embeddings capture meaning? </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoI2vrheiyilTbIf_nOI9y2ANfw7ZwSM8JUUk0hq8nABHGId9v45kF2tAtb0Emb0SZkXWhJsbYyi-vbOQM189oDPeLOXUd90B3yJTx-ALIL--8MJI_0NeOQ8TZuUKG5NxAkJqAZ9pCRdTJv02z7Yy5-runkot29_pABSgywULgGurwDlkzPqvljJtIgQ/s640/vectorword2vec.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="210" data-original-width="640" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhoI2vrheiyilTbIf_nOI9y2ANfw7ZwSM8JUUk0hq8nABHGId9v45kF2tAtb0Emb0SZkXWhJsbYyi-vbOQM189oDPeLOXUd90B3yJTx-ALIL--8MJI_0NeOQ8TZuUKG5NxAkJqAZ9pCRdTJv02z7Yy5-runkot29_pABSgywULgGurwDlkzPqvljJtIgQ/w640-h210/vectorword2vec.png" width="640" /></a></div><div><br /></div>If you look at the positions of the embeddings, they seem to have some logic to them <br /><p><i style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; text-align: justify;"></i></p><blockquote><i style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; text-align: justify;">Embeddings consist of thousands of numbers for each "word", actually tokens (one for each "word" in the "dictionary"), so they are technically represented in multi-dimension space, the diagram above has "squeezed them down" to just 2d.</i></blockquote><p></p><p style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;">Mathematically you could do something like</p><p style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;"><strong></strong></p><blockquote style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; text-align: justify;"><i>vector(‘king’) - vector(‘man’) + vector(‘woman’) result is close to vector(‘queen’).</i></blockquote><p>Below <a href="https://www.cs.toronto.edu/~lczhang/360/lec/w06/w2v.html">shows an example using Glove a variant of Word2Vec, where asking for the closest term to King-man+woman gets you 'Queen' or even 'Elizabeth' (a queen)</a></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOAYOfwiQl_I_VsZtKAxqyye-eWgPjTGKWF1cmyLlV3VCmbGUVEzdeTYHb2EMt5_voPYZhtNT8-Grw2pBXP1wkbUdqG75U3NyGuOtwWgbaOrXBvqF1WiJ8uMfEiiybgnvJZC7JHZWuOhDQCVfFFVh3xZD5ZZznnKWixdI2Y26FJgf1b_D5frkDbV6_bqiu/s1090/glove.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="153" data-original-width="1090" height="90" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiOAYOfwiQl_I_VsZtKAxqyye-eWgPjTGKWF1cmyLlV3VCmbGUVEzdeTYHb2EMt5_voPYZhtNT8-Grw2pBXP1wkbUdqG75U3NyGuOtwWgbaOrXBvqF1WiJ8uMfEiiybgnvJZC7JHZWuOhDQCVfFFVh3xZD5ZZznnKWixdI2Y26FJgf1b_D5frkDbV6_bqiu/w640-h90/glove.png" width="640" /></a></div><p><span face=""Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif" style="background-color: #fafafa; color: #333333; font-size: 14px; text-align: justify;">One major difference between embedding and the traditional vectors created using TF-IDF is that embeddings have a fixed length and do not depend on the number of unique words in the collection. This is because each number in the vector doesn't represent a literal word, but is a representation of some concept or meaning. </span></p><p style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;">That said embedding vectors are almost impossible to interpret, since machine learning is used to automatically map latent or hidden meaning to each of these numbers in the vector (equal to columns in the the document-term matrix)</p><p style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;"><br /></p><h3 style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;">How are basic embeddings created?</h3><p style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 1em 0px; outline: none; padding: 0px; text-align: justify;">Part of the issue with searching is the problems of synonyms, how do we know when you search for "cars" you are also searching for "automobiles"? Being able to look for words that are semantically "close" is clearly very useful. </p><div style="background-color: #fafafa; color: #333333; font-family: "Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 14px; margin: 0px; outline: none; padding: 0px; text-align: justify;">So, you can see how an automated way to train embeddings or representations for words can be especially useful.</div><p>But how are such embeddings created? In general, the idea of embeddings is that similar words tend to be used in very similar contexts. Intuitively, if you have a sentence with a missing word and two words X or Y can be used in place of the missing words and this is true in other contexts, X and Y are somewhat related in meaning and should have very similar embeddings.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9JT2_TeBP-RVFs0ZV51R2ZpkbEn2JO1HcwkWt-kCewIDCAGfYONkjE1xSv3yNDbLm5G7LB4fGdboYysIgFCBgVCGkxSl59zS8DVj9IZM-0-2nvZkW1R44oEhuDTZmOLRKUDKpH-1zFXbQ_pPDPtTtypEBbp2KvklHwmU6XMOwP3Efadfx7IQ1RaPWqvam/s1064/cbow2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="521" data-original-width="1064" height="314" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi9JT2_TeBP-RVFs0ZV51R2ZpkbEn2JO1HcwkWt-kCewIDCAGfYONkjE1xSv3yNDbLm5G7LB4fGdboYysIgFCBgVCGkxSl59zS8DVj9IZM-0-2nvZkW1R44oEhuDTZmOLRKUDKpH-1zFXbQ_pPDPtTtypEBbp2KvklHwmU6XMOwP3Efadfx7IQ1RaPWqvam/w640-h314/cbow2.png" width="640" /></a></div><div style="text-align: center;"><a href="https://medium.com/nerd-for-tech/nlp-zero-to-one-dense-representations-word2vec-part-5-30-9b38c5ccfbfc#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjViMzcwNjk2MGUzZTYwMDI0YTI2NTVlNzhjZmE2M2Y4N2M5N2QzMDkiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTYzMzUwOTczODYyMDU0NDgwNTkiLCJlbWFpbCI6ImFhcm9udGF5QGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJuYmYiOjE3MDA0OTAxNjUsIm5hbWUiOiJBYXJvbiBUYXkgKEFhcm9uIFRheSkiLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EvQUNnOG9jSXl6N2ZsMXlXRmtWRFpUQVc2Q2wtSGl3Y0prVVpoa3lwLVRScFczWWtEYVg0PXM5Ni1jIiwiZ2l2ZW5fbmFtZSI6IkFhcm9uIiwiZmFtaWx5X25hbWUiOiJUYXkiLCJsb2NhbGUiOiJlbiIsImlhdCI6MTcwMDQ5MDQ2NSwiZXhwIjoxNzAwNDk0MDY1LCJqdGkiOiI0YmM2Nzk3MjUzNWM4ZGIxNDg3OTYyN2Y4MDZiOGZkNjI1ZjIyZDY3In0.teDRQUQ-ULIQeUSFLsD-NDmJ0txgBGEw1ICU_VE9IwYAkYZLFFrALPhWDOjYPRnObkukNYIWY7NWFOPoPr7YCfsTWiKiNE2dcYcn6E1mdn6cMwRod0F3m7miu8migd-mlrVmT6d002uf_-hS4jrnaGm0A28-_XvOj22P7IXxHSTtysqztpipUKf3nMfVXUsJQpHtJbFP5lZfn3BS_HJsYyz5Ur6oLl7EHr1gV3F5RxLyoTOzen3FCBAo0YMZPYp7a0eEo6wJzNSy7olpTFv-UJdi0jTPOiGpahd89Rks_m7YkxTcpMl8_EDw8cOdypcGWqNRBRgOXyJbb9rdeJCSLg">Example of CBOW</a></div><p style="text-align: center;"><br /></p><p>In the example above both "sad" and "unhappy" would fit into the sentence and hence should have similar embeddings and meaning.</p><p>This is the idea of Word2Vec embeddings, where a deep learning neutral net model is fed a large chunk of text and is used to learn how to predict such patterns. </p><p>For word2vec, they <a href="https://towardsdatascience.com/nlp-101-word2vec-skip-gram-and-cbow-93512ee24314#:~:text=In%20the%20CBOW%20model%2C%20the,used%20to%20predict%20the%20context%20.">either throw the deep learning model a sentence with a missing word and the model has to try to learn to predict what the missing word is (Continuous Bag of Words Model) or given words, the model as to try to predict the words before and after (skip-gram).</a> </p><p>Below shows <a href="https://medium.com/nerd-for-tech/nlp-zero-to-one-dense-representations-word2vec-part-5-30-9b38c5ccfbfc#id_token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjViMzcwNjk2MGUzZTYwMDI0YTI2NTVlNzhjZmE2M2Y4N2M5N2QzMDkiLCJ0eXAiOiJKV1QifQ.eyJpc3MiOiJodHRwczovL2FjY291bnRzLmdvb2dsZS5jb20iLCJhenAiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJhdWQiOiIyMTYyOTYwMzU4MzQtazFrNnFlMDYwczJ0cDJhMmphbTRsamRjbXMwMHN0dGcuYXBwcy5nb29nbGV1c2VyY29udGVudC5jb20iLCJzdWIiOiIxMTYzMzUwOTczODYyMDU0NDgwNTkiLCJlbWFpbCI6ImFhcm9udGF5QGdtYWlsLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlLCJuYmYiOjE3MDA0OTAxNjUsIm5hbWUiOiJBYXJvbiBUYXkgKEFhcm9uIFRheSkiLCJwaWN0dXJlIjoiaHR0cHM6Ly9saDMuZ29vZ2xldXNlcmNvbnRlbnQuY29tL2EvQUNnOG9jSXl6N2ZsMXlXRmtWRFpUQVc2Q2wtSGl3Y0prVVpoa3lwLVRScFczWWtEYVg0PXM5Ni1jIiwiZ2l2ZW5fbmFtZSI6IkFhcm9uIiwiZmFtaWx5X25hbWUiOiJUYXkiLCJsb2NhbGUiOiJlbiIsImlhdCI6MTcwMDQ5MDQ2NSwiZXhwIjoxNzAwNDk0MDY1LCJqdGkiOiI0YmM2Nzk3MjUzNWM4ZGIxNDg3OTYyN2Y4MDZiOGZkNjI1ZjIyZDY3In0.teDRQUQ-ULIQeUSFLsD-NDmJ0txgBGEw1ICU_VE9IwYAkYZLFFrALPhWDOjYPRnObkukNYIWY7NWFOPoPr7YCfsTWiKiNE2dcYcn6E1mdn6cMwRod0F3m7miu8migd-mlrVmT6d002uf_-hS4jrnaGm0A28-_XvOj22P7IXxHSTtysqztpipUKf3nMfVXUsJQpHtJbFP5lZfn3BS_HJsYyz5Ur6oLl7EHr1gV3F5RxLyoTOzen3FCBAo0YMZPYp7a0eEo6wJzNSy7olpTFv-UJdi0jTPOiGpahd89Rks_m7YkxTcpMl8_EDw8cOdypcGWqNRBRgOXyJbb9rdeJCSLg">part of the training data that is used to train using the Continuous Bag of Words Model.</a></p><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjew6TFb3yg0WmimxASvcpPKlYSNnaWNfDHymEcZUTJ5ZsynFsaBWPWbPNaLF0gZLOqlKqgSkHnP904rOsnlEET2fMwYm3mfdW0_HUtqMWfpmcg_lnQODDp91YmNp16NPDXC-T-15XOJwM1zGLvlHCnjOGRKekqwJSn9thL2hVL307_HNHrJlChoUKxTFTD/s1725/cbow3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="706" data-original-width="1725" height="262" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjew6TFb3yg0WmimxASvcpPKlYSNnaWNfDHymEcZUTJ5ZsynFsaBWPWbPNaLF0gZLOqlKqgSkHnP904rOsnlEET2fMwYm3mfdW0_HUtqMWfpmcg_lnQODDp91YmNp16NPDXC-T-15XOJwM1zGLvlHCnjOGRKekqwJSn9thL2hVL307_HNHrJlChoUKxTFTD/w640-h262/cbow3.png" width="640" /></a></div>It shows how you can use a chunk of text to train in a self-supervised way, where the aim is to predict the target word (in red above) from the context of the words in green. This particular training uses context window = 2, so it always tries to use the context of 2 words in front and behind the target word (if there are 1 or 0 words in front of target, it used that)<br /><div><br /></div>Given say the context <div><br /></div><div>"number", "of" <target> "words", "to" , the neutral net should be trained to predict the target word "surrounding".<p>Once trained this embedding model will have weights that can be used to convert any text into embeddings.</p><p><iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" frameborder="0" height="315" src="https://www.youtube.com/embed/5MaWmXwxFNQ?si=pDBMstpjKgKqvxSU" title="YouTube video player" width="560"></iframe></p><h4 style="text-align: left;"><br /></h4><h4 style="text-align: left;">Side-note on Sparse embeddings vs Dense embeddings</h4>You will occasionally see <a href="https://www.youtube.com/watch?v=SacG_VtTtyk">TF-IDF/BM25 referred to as sparse embeddings and word2Vec and more advanced embeddings in use today as dense embeddings.</a> This is now easy to understand.</div><div><br /></div><div><br /></div><div>Sparse vectors typically use the bag of words approach, where you either use raw count or using TF-IDF). <div><br /></div><div>This of course results in a very long vector (high dimension space) with mostly zeroes in a <a href="https://en.wikipedia.org/wiki/Document-term_matrix">Document-term matrix</a> since you will need one column for every word (token) in the dictionary and most documents will not have a high % of the words, so for each row, most cells or elements will be zero, hence "sparse" embedding since most documents will be represented by long string of numbers mostly zero.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdqCS9F_pquOMcMMZKwfpGiaYAtVR2nAtmdE-B_jmX-OLdZsmAaPtn7swZXFug8rrb7WgpZWZ73PB7Pigxbt6UZCqQnrj3NPDfqTDeI9ratTOrPUEDtu_qIE68XbHwVDwqIosF04BREriaYsTGdrcKQib623JU6e-BdQjPPX26Fia3JjxCnWCFq6kYBInF/s682/dtm.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="236" data-original-width="682" height="222" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjdqCS9F_pquOMcMMZKwfpGiaYAtVR2nAtmdE-B_jmX-OLdZsmAaPtn7swZXFug8rrb7WgpZWZ73PB7Pigxbt6UZCqQnrj3NPDfqTDeI9ratTOrPUEDtu_qIE68XbHwVDwqIosF04BREriaYsTGdrcKQib623JU6e-BdQjPPX26Fia3JjxCnWCFq6kYBInF/w640-h222/dtm.png" width="640" /></a></div><div style="text-align: center;"><i>Sample example of a document-term matrix</i></div><div style="text-align: center;"><br /></div><div><br /></div><div>Dense vectors on the other hand have shorter length vectors (lower dimension space) and much less of the vector elements are zero since each dimension represents a concept or meaning and most or more cells in the doc-term matrix will be non-zero.</div><div><br /></div><div>This has already mentioned, is usually achieved nowadays by using machine learning to learn representations so similar words have similar embeddings.</div><div><br /></div><div>That said older ways to create dense vectors did exist and used methods like <a href="https://www.datacamp.com/tutorial/discovering-hidden-topics-python">Latent semantic indexing (LSI)/ Latent semantic analysis (LSA) </a>but did not give much performance gains.</div><div><br /></div><div><a href="https://www.pinecone.io/learn/splade/">One disadvantage of dense vectors is that the embedding produced <span>are</span> pretty much impossible to interpret (columns in the doc-term matrix for semantic search are hard to interpret)_ and depending on the method can be computationally expensive (see later).</a> </div><div><br /></div><div>So, one way around this is to do a two-stage search and reranking system where the sparse retriever first retrieves a large set of results, and this is reranked using Dense reranker. This ensures results/documents retrieved are explainable even though the order of them might be a Blackbox.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggn2C5XWFY3VLyOiQLWEUDegeJKuuHhXchZQHs_clMLbFEGYcISnILHofAATjCFhzzPT0-Lyt_YHuf_IZZByz_4jNsWheh6YxKUBVxZP1Aduvm6p_J9yy7KUUEreNXCk_cqaUvsoF5vaTEtzOv7XyWg1nxUVIzUZvGdE51UEjvqCAlwuItGBisQFB-4E9L/s903/2stageparse.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="578" data-original-width="903" height="410" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEggn2C5XWFY3VLyOiQLWEUDegeJKuuHhXchZQHs_clMLbFEGYcISnILHofAATjCFhzzPT0-Lyt_YHuf_IZZByz_4jNsWheh6YxKUBVxZP1Aduvm6p_J9yy7KUUEreNXCk_cqaUvsoF5vaTEtzOv7XyWg1nxUVIzUZvGdE51UEjvqCAlwuItGBisQFB-4E9L/w640-h410/2stageparse.png" width="640" /></a></div><br /><div style="text-align: center;"><a href="https://www.pinecone.io/learn/splade/">Two stage retriever using sparse retriever first than dense reranker</a></div><div style="text-align: center;"><br /></div><div><br /></div><div><br /></div><div><br /></div><div>Methods like <a href="https://europe.naverlabs.com/blog/splade-a-sparse-bi-encoder-bert-based-model-achieves-effective-and-efficient-first-stage-ranking/">SPLADE try to have the best of both words with Sparse embeddings enhanced with learning how to adding new terms (expansion) and/or removing existing terms (compression) automatically using advanced large language models like BERT.</a></div><div><p></p><blockquote>The most significant advantage of SPLADE is not necessarily that it can do term expansion but instead that it can <i>learn term expansions</i>. Traditional methods required rule-based term expansion which is time-consuming and fundamentally limited. Whereas SPLADE can use the best language models to learn term expansions and even tweak them based on the sentence context. <a href="https://medium.com/@zz1409/sparse-representations-of-text-for-search-f231301eacf">See here</a></blockquote><p></p><p><br /></p><h3 style="text-align: left;">Rise of contextual embeddings from transformer based large language models (eg BERT), 2018-</h3><div>The main issue with older methods was that it couldn't capture meaning (semantic) automatically, and Word2vec and similar embeddings at the time helped. But it still didn't fully solve the issue of taking into account word order in documents or queries.</div><div><br /></div><div>Take the query (<a href="https://blog.google/products/search/search-language-understanding-bert/">this example was taken from Google's announcement of they use of BERT in 2019</a>)</div><div><i><br /></i></div><div><i>2019 brazil traveler to use need visa</i> </div><div><br /></div><div>the systems needs to consider the order of the query and understand it is about someone from Brazil going to US and not vice versa. To solve such issues, every word counts, and you the latest embeddings (from 2018 onwards) far improve on Word2Vec type embeddings <a href="https://musingsaboutlibrarianship.blogspot.com/2023/04/did-you-know-how-embeddings-from-state.html">due to improvements like the self-attention mechanisms and positional embeddings</a>. </div><p>In fact, Word2vec type embeddings were just the beginning. Transformer based large language models (invented in 2017) were found to produce even better embeddings <a href="https://musingsaboutlibrarianship.blogspot.com/2023/04/did-you-know-how-embeddings-from-state.html">due to improvements like the self-attention mechanisms and positional embeddings</a> allowing such systems to take into account position/sequences of words. </p><p>The most famous example of these types of new embeddings was introduced in 2018 was named Bidirectional Encoder Representations from Transformers (BERT) and was quickly adopted everywhere leading to huge gains in state of art.</p><p>By 2019, even <a href="https://blog.google/products/search/search-language-understanding-bert/">Google announced they were using BERT to interpret queries in 10-15% of their searches and for multilingual featured snipplets.</a></p><p></p><blockquote>Another advantage of embeddings is <a href="https://blog.research.google/2020/08/language-agnostic-bert-sentence.html#:~:text=A%20multilingual%20embedding%20model%20is,semantic%20information%20for%20language%20understanding.">because they supposedly capture semantics you can create language agnostic embeddings that work across different languages!</a> (<a href="https://docs.cohere.com/docs/multilingual-language-models">See also</a>)</blockquote><p></p><h3 style="text-align: left;"><br /></h3><h3 style="text-align: left;">Further improvements on contextual embeddings</h3><div>But even BERT models are just the beginning. Today besides BERT family of models, there are many other advanced embedding models that have been further fine-tuned to do well for semantic search and reranking including</div><div><p></p><ul><li>Variants based on BERT / SBERT - MiniLM, mpnet-base family (light weight high performance)</li><li>Variants using Google's T5 as a base</li><li>GPT based e.g. OpenAI's text-embedding-ada family, Cohere embedding etc</li><li>Domain specific trained on scholarly publication and pretrained with citation signals - e.g. <a href="https://huggingface.co/allenai/specter">SPECTER by Allen institute for AI</a> and specific BERT models finetuned with papers e.g scibert, finbert or even vendor specific ones like <a href="https://www.dimensions.ai/blog/powering-research-with-dimensions-ai-assistant/">Dimensions General Science-BERT</a></li><li><a href="https://blog.metarank.ai/from-zero-to-semantic-search-embedding-model-592e16d94b61">More</a></li></ul><div>Check out <a href="https://huggingface.co/spaces/mteb/leaderboard">benchmarks like BEIR/METR on the performance of diff embedding models on various tasks eg semantic search, reranking.</a></div></div><div><br /></div><div><br /></div><div>Here's a brief over-view on further improvements</div><p>Firstly, how good embeddings are (even ones from state of art Large Language Models) highly depends on the data that was using to train on. For example, if journal articles were used mostly for training the embedding models, it may not work well when used to find similar articles in the domain of newspapers! </p><p>In fact, most standard BERT base models were/are trained at least partly on Wikipedia text, so it may be possible to create or fine-tune embeddings models more suitable for specific domains like academic articles or even specific domains like Chemistry (ChemBERT), Finance (e.g. FinBert), SciBERT etc. <a href="https://huggingface.co/allenai/specter">SPECTER by Allen institute for AI</a> even considers citations!</p><h3 style="text-align: left;"><br /></h3><h3 style="text-align: left;">In-domain training to further push the limits</h3><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj8vmmWFJyt399Tt_ZFxRme0qwrEqFo_oQdzsk6G8D9qJmoYOp9vvW1UQx72mU5DFcmieQUP9q-skm1wFMehJmq4g_JtK4LQCmVOadbG_X4qftWx2ZEkkZkZZYbyn8SOzMvjInT_7j5_r7NiI3VzMoUp_1-uq7fJpXMFdLg1LfeRE3aTf5BIOBIDRPylVT/s905/indomain.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="474" data-original-width="905" height="210" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjj8vmmWFJyt399Tt_ZFxRme0qwrEqFo_oQdzsk6G8D9qJmoYOp9vvW1UQx72mU5DFcmieQUP9q-skm1wFMehJmq4g_JtK4LQCmVOadbG_X4qftWx2ZEkkZkZZYbyn8SOzMvjInT_7j5_r7NiI3VzMoUp_1-uq7fJpXMFdLg1LfeRE3aTf5BIOBIDRPylVT/w400-h210/indomain.png" width="400" /></a></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa/">Source</a></div><div><br /></div>BERT models have a pretraining phrase where they are trained using the masked language model where it [mask] words in text (somewhat like the CBOW from Word2vec<a href="https://neptune.ai/blog/unmasking-bert-transformer-model-performance"> but not exactly the same</a>) and Next Sentence Prediction Tasks.<br /><p>Unlike the GPT models, BERT models are encoder only models, and they do not generate anything (because they lack a decoder) but embeddings. You also are expected to generally do further fine tuning on the model for the task you want to do with the BERT model.</p><p>In this case, how or what should you fine-tune?</p><p>Arguably <a href="https://txt.cohere.com/what-is-semantic-search/">embeddings/similarity alone based on masked language tasks are not always sufficient to get relevant results just from pretraining</a>. Also, even if query passages and document passages are similar it does not mean they are the answers or documents needed. Could you not improve performance further by training the model further on specifically labelled documents and queries? </p><p>This is where <a href="https://www.elastic.co/blog/improving-information-retrieval-elastic-stack-search-relevance">supervised learning techniques on large datasets with queries and answers </a>like <a href="https://paperswithcode.com/dataset/ms-marco">MS MARCO</a> are done to push the limits of performance further.</p><p>However, this is highly domain specific and not all domains have the needed labelled data to do supervised learning and there is now active research in finding self-supervised techniques like Negative sampling/contrastive or even using of Language models to generate synthetic data for such domains.</p><span face=""Helvetica Neue Light", HelveticaNeue-Light, "Helvetica Neue", Helvetica, Arial, sans-serif" style="background-color: #fafafa; color: #333333; font-size: 14px; text-align: justify;"></span><p></p><p>It is also unclear as you whether fine-tuning on domain specific examples might hurt performance if it is used out of domain.</p><div>These days you are not even limited to using embeddings from Encoder only models like BERT but even popular GPT type Decoder only models like OpenAI and Cohere provide embeddings. I suspect for quick and easy use <a href="https://platform.openai.com/docs/guides/embeddings">OpenAI's text-embedding-ada-002 </a>is extremely popular! These embeddings are generated from even larger text sources than BERT models but it's unclear how they match up against BERT domain specific models.</div><div><br /></div><p></p><h2 style="text-align: left;">Bi-encoder vs cross-encoder</h2><div>Another important concept to understand is the difference between a bi-encoder and a cross-encoder. This impacts how the query, and the documents interact.</div><div><br /></div><div>The example shows an example of a system trying to decide if two sentences are similar using a sentence embedding based on BERT. In the context of search one sentence would be the query, one sentence would be from the potential matching document.</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZ6hgaKel6HC1P04mlIdY1O7WtHu2JDFBpyN9CAcd9PmNyPGsDv4hUZLMpWDrwphUsYTTLMos9r9hVd8l-dCN6sVWHGvYFoCuQ0oXg-_E7lN7FwqJOEuh08tWSxh1eu5EB9kUuYcbZggbxmjue69g4kQMS4L9uEixMdpYsi5M9FrnBZ78cOIdvOlRtgO46/s1207/cross-biencoder.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="582" data-original-width="1207" height="308" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjZ6hgaKel6HC1P04mlIdY1O7WtHu2JDFBpyN9CAcd9PmNyPGsDv4hUZLMpWDrwphUsYTTLMos9r9hVd8l-dCN6sVWHGvYFoCuQ0oXg-_E7lN7FwqJOEuh08tWSxh1eu5EB9kUuYcbZggbxmjue69g4kQMS4L9uEixMdpYsi5M9FrnBZ78cOIdvOlRtgO46/w640-h308/cross-biencoder.png" width="640" /></a></div><br /><p>The most common type of system implements a Bi-encoder where both sentence A (the query) and sentence B (potential matching sentence in document) will be converted into embeddings separately before they are checked for similarity using a function like Cosine-similarity/dot product.</p><p>The main advantage of this method is that you can precompute the embeddings for all the documents of your database in advance and store it in a vector database. So, when the search is run, you just need to convert the query into an embedding and then you can use methods that are reasonably fast (approximate nearest Neighbour) to quickly find the embeddings that are closest.</p><p>Now consider the alternative cross-encoder method. The Cross-encoder has an even better performance, since you push both query and document into one model which allows the model to take into account interactions between both items (as opposed to the bi-encoder method where you create two separate embeddings and allow interaction only via something like dot product). </p><p>But cross encoders are clearly more difficult to do computationally since you cannot precompute the embeddings for documents like for a bi-encoder. If the database has 100k documents, you need to do all 100k documents embeddings + query pairs on the fly during the search!</p><p>A workaround to get the best of both worlds is to use some other method to first cut down the number of top candidates results then do a rerank of these top candidates using a cross encoder.</p><p>One obvious method is to <a href="https://sbert.net/examples/applications/retrieve_rerank/README.html">use the biencoder to grab the top 100 then the crossencoder to rerank.</a></p><p>There is also a late interaction model eg COLBERT (Contextualized late interaction over BERT) that is in between the bi-encoder and cross encoder.</p><p><br /></p><h2 style="text-align: left;">Is the newer Dense embedding better than Sparse embeddings?</h2><p>In general, dense embeddings are better than standard sparse embeddings in specific domains but tend to be weaker if used out of domain according to the 2021 BEIR paper.</p><p>So for example, to take a crude example if your dense embedding is trained (either supervised with labelled pairs or self-supervised) on journal articles and you are trying to do newspaper searches it can fail. </p><p><a href="https://arxiv.org/pdf/2104.08663.pdf">Indeed a 2021 paper found that accross a wide variety of settings(domains and tasks) such as in the BEIR test, BM25 is a very strong base line to beat.</a> In short, if you want something to work well across a large variety of different domains or are unable to be sure the old school BM25 is your best bet!</p><p>However, this may be a less relevant problem in academic search where the domain tends to be quite restricted.</p><p>Besides technology marches on and a <a href="https://blog.metarank.ai/from-zero-to-semantic-search-embedding-model-592e16d94b61">June 2023 blog surveying the results from the same benchmark concludes that in 2021, while BM25 was a very strong baseline it is no longer a clear winner in 2023</a> and is clearly beaten by the latest dense embedding like E5 and newer sparse embedding methods like ColBERT (<a href="https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa-part-two/">Vespa’s ColBERT</a>), <a href="https://arxiv.org/abs/2109.10086">SPLADEv2</a>, <a href="https://www.elastic.co/blog/may-2023-launch-information-retrieval-elasticsearch-ai-model">ELSER</a></p><br /><a href="https://medium.com/@zz1409/combining-embedding-and-keyword-based-search-for-improved-performance-b15b0cfd3152">As of 2023 the belief is are a hybrid approach (and there are many) of Lexical (Sparse embeddings) and Semantic (Dense embeddings) is a better approach because</a><br /><br /><blockquote>Dense vectors are effective for encoding local syntactic and semantic cues leveraging recent advances in contextualized text encoding (Devlin et al., 2019), while sparse vectors are superior at encoding precise lexical information.</blockquote>Hybrid approaches also help reduce the weakness of Dense embedding approaches of being weaker in out of domain tasks (which BM25 excel at).<br /><br />There are <a href="https://medium.com/@zz1409/combining-embedding-and-keyword-based-search-for-improved-performance-b15b0cfd3152">multiple ways to do so from</a></div><div><br /></div><div><ul style="text-align: left;"><li>Dense first search — Retrieve candidates using a dense model and rerank using the sparse model</li><li>Sparse first search — Retrieve candidates using the sparse model and rerank using a dense model</li><li>Hybrid search — Retrieve candidates from both lists and combine them as a post-processing step</li></ul><div><br /></div>It seems the current evidence (2023) and industry practice favours a hybrid approach as doing either Dense first or Sparse first risk missing relevant documents, see t<a href="https://medium.com/@zz1409/combining-embedding-and-keyword-based-search-for-improved-performance-b15b0cfd3152">his article for discussion of why</a><br /><br />Another more practical issue is that users often want control and predictability in their searches, what happens if they search and also require that the documents be after 2020 only? Or if they want an exact match in the title field only?<br /><br />All this points to the need of a <a href="https://marcussorealheis.medium.com/the-future-of-search-is-semantic-and-lexical-e55cc9973b63">hybrid system using both semantic and lexical features…</a>, <a href="https://jxnl.github.io/instructor/blog/2023/09/17/rag-is-more-than-just-embedding-search/#case-study-2-personal-assistant">See also</a><h1 style="text-align: left;"><a name="jstorgenai"></a>Why I think semantic search might be finally coming for academic search</h1><p>It is interesting that when you look at academic search, even Google Scholar, their search results are still mostly traditional keyword based. Surely, the search isn't 100% explainable sometimes, but there is still an expectation that most of the time, you will not be wondering why a certain document appeared in response to your search (with some fogginess due to stemming, synonyms etc.) and documents are still matched one on one with query keywords.</p><p>I believe this might be changing for the following reasons.</p><p>Firstly, the popularity of ChatGPT means people are getting used to searching with natural language further reinforcing the trend Google began. While I don't believe keyword searching will go away that quickly, I believe there should be an uptick in natural language searching and if that happens the most natural way of supporting it would be via semantic search.</p><p>Secondly and most importantly, I think Semantic Search has reached a point where they match and often outdo the performance of traditional keyword-based/Boolean search.</p><p>I've often given an example of how in some search queries - Elicit.com trounces Google Scholar in finding relevant papers even though Google Scholar has a huge advantage in terms of the size of the index, particularly in terms of full-text.</p><p>Take the example below - how to find seminal works </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_eZF22HLex-yqNd57CNbcQMPA50VT8GR8SV-l18AGDT0dVeVjydkv5BUKxlTHYlrXCq7lghQUJoInYNAAHdOFxzRk7yuSaQwZkDdBggOVBrdW9KQQAz7CrIvZZ8ljdlb91u6uxxixLxgHD8OB3qONsZWWKt8ovM_zXId-ZwoL_9XmXXWjqMiffn7Zzyh5/s1064/gscholar-fail.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="907" data-original-width="1064" height="546" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_eZF22HLex-yqNd57CNbcQMPA50VT8GR8SV-l18AGDT0dVeVjydkv5BUKxlTHYlrXCq7lghQUJoInYNAAHdOFxzRk7yuSaQwZkDdBggOVBrdW9KQQAz7CrIvZZ8ljdlb91u6uxxixLxgHD8OB3qONsZWWKt8ovM_zXId-ZwoL_9XmXXWjqMiffn7Zzyh5/w640-h546/gscholar-fail.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><p>Google Scholar totally fails here and the words that are highlighted in the snippets tell you why. It clearly doesn't get the intent.</p><p>Elicit.com does way better (but not perfect).</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZ1ARc8qq58Si6Bc-CMHt04tq912xJqJR9cHdXVDgL93jqS5PuhZam2Bl1Vx3rTJBMsC4bFzZ5WMN1sargwdqaJDEqTGBYDXFE-zuDz4Bf4uo8vUiQMVGUHnLcEBNIspiaXX6Gni187BFzGQDiZMNcK56JJhyphenhyphensmZGnJS-I8V-TzrtWWPOOKvhvZ7GUF6ik/s1886/elicit.com-win.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="865" data-original-width="1886" height="294" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhZ1ARc8qq58Si6Bc-CMHt04tq912xJqJR9cHdXVDgL93jqS5PuhZam2Bl1Vx3rTJBMsC4bFzZ5WMN1sargwdqaJDEqTGBYDXFE-zuDz4Bf4uo8vUiQMVGUHnLcEBNIspiaXX6Gni187BFzGQDiZMNcK56JJhyphenhyphensmZGnJS-I8V-TzrtWWPOOKvhvZ7GUF6ik/w640-h294/elicit.com-win.png" width="640" /></a></div><br /><p>Defenders of Google Scholar might say you could get better results if you used quotes around "seminal works" (only slightly) or that you use just drop "how to" (still bad). In fact, I found that by searching with query <b>identify "seminal works"</b> the results get better, but it does show how sensitive the search is to the right keyword! </p><p>By comparison Elicit.com's results do get slightly better if you try searching<b> identify seminal works,</b> but in general the results are less sensitive to minor changes like quotes, identify vs find which makes sense given it is trying to match semantics. </p><p><br /></p><h2 style="text-align: left;">JSTOR generative AI pilot now provides a experimental new mode doing Semantic Search</h2><p>For reasons already mentioned, academic search engines are often black boxes, and it is hard to tell if they are using Semantic Searches. </p><p>JSTOR generative AI pilot however makes it easy. They recently launched a new "experimental search" that "understands your query and provides more relevant results, even if you don't use the exact words."</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuHgFhL3-ASVEz2hFoy8rlkRdyitCjF1TNLB_KyJMEtM0PFwmXneXIF1icvvz_CVxD_d3bh7J6DjG1UJsvcg2qQIAFUePJA-9nD0SlA9WWOQC-EwhoRq7hThx7PZ2n1wO9ctUv-YHBpadxfhqIh31WxqgMi3Yi8Jxfatsmd7On5pKP8089_58TB8TwRKH1/s1520/jstor-gensearch.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="292" data-original-width="1520" height="122" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjuHgFhL3-ASVEz2hFoy8rlkRdyitCjF1TNLB_KyJMEtM0PFwmXneXIF1icvvz_CVxD_d3bh7J6DjG1UJsvcg2qQIAFUePJA-9nD0SlA9WWOQC-EwhoRq7hThx7PZ2n1wO9ctUv-YHBpadxfhqIh31WxqgMi3Yi8Jxfatsmd7On5pKP8089_58TB8TwRKH1/w640-h122/jstor-gensearch.png" width="640" /></a></div><div><br /></div><br />"They use a variety of factors to understand the meaning of your query and the relationships between different concepts. It helps you find what you are looking for, even if you don't use the exact words."<br /><p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnNItN4rm2Qo0pGZJlUxoUPMtiTmdHqwzel78EWydPkj-FBVMrIo_zDYYIkL9x014zxEmG5NwecbzJdgjrppHtLlxk5uHJndu6onPjl1_Zava3bgangg1e2a67gqUYEpn-iJRs6ucR6ZyVbRvS0EwhXDK2oAan3I8ZUv53_oO4pyqWJQ5PyiDSUrXGGJdj/s1510/jstor-gensearch-2.png" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" data-original-height="207" data-original-width="1510" height="88" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgnNItN4rm2Qo0pGZJlUxoUPMtiTmdHqwzel78EWydPkj-FBVMrIo_zDYYIkL9x014zxEmG5NwecbzJdgjrppHtLlxk5uHJndu6onPjl1_Zava3bgangg1e2a67gqUYEpn-iJRs6ucR6ZyVbRvS0EwhXDK2oAan3I8ZUv53_oO4pyqWJQ5PyiDSUrXGGJdj/w640-h88/jstor-gensearch-2.png" width="640" /></a></p><p>Clearly some sort of semantic type search being used. I haven't given it a full workout but so far, I am quite impressed.</p><p>One of the issues using such semantic search tools is that it is unclear to me how we should search.</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0Elqf0eKuDHQKj_AqgQtj0_2j1wqYMrdCezsgnKzBVw4z0I-eWpNtfG4jNiaQVlpkz03SSRxAOu0xbeaAxILDy7N7yhyphenhyphenoxCScNKsKkKjdn-RCsng24ZipUEbe-nnZha0E3QFtXZsLtLPdXQm8wRDf1Q5Awu813PW0H6ZxRzDSSpf1sXjgKLlK6IhL8znh/s722/howshouldonesearch.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="356" data-original-width="722" height="316" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj0Elqf0eKuDHQKj_AqgQtj0_2j1wqYMrdCezsgnKzBVw4z0I-eWpNtfG4jNiaQVlpkz03SSRxAOu0xbeaAxILDy7N7yhyphenhyphenoxCScNKsKkKjdn-RCsng24ZipUEbe-nnZha0E3QFtXZsLtLPdXQm8wRDf1Q5Awu813PW0H6ZxRzDSSpf1sXjgKLlK6IhL8znh/w640-h316/howshouldonesearch.png" width="640" /></a></div><br /><p>If it is a straight-forward strict boolean and lexical type search clearly only searching conventional keyword style is the way to do and trying to type in natural language is going to fail.</p><p>But given many of the new academic search can now give direct answers and with the influence of ChatGPT, it may start to get a bit confusing about whether one should type in natural language or even do long prompt engineering style inputs!</p><p>In general, if there is truly a semantic search implemented it should be able to handle single natural language inputs and even conventional lexical search might if the stop words are dropped. Prompt engineering style inputs I think are less likely to work particularly for straight out search engines that can't do any other task but search.</p><p>In the case of JSTOR generative AI, I tried in both natural language and keyword style.</p><p><br /></p><p><b>JSTOR generative AI - Is there an open access citation advantage? (natural language style)</b></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijSWZhES-2wDPzuuWohpjc1ACGmur5b_MbmYRwEMqRxjOGvVrERb0HRFBxxjgqkNs0ody9_LXANA09I0W4sfNU7PxQYgdtsnwg1zJfo6dloufKMZaWXR4iRaiYYvQvm7048hvkBeu0d47jvWocc3Sev2XTEHXGcxnCYUC3OzVUxGpdReCiiiGL-Q26lqHj/s965/isthereanopenaccesscitationadvantage-1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="965" data-original-width="713" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEijSWZhES-2wDPzuuWohpjc1ACGmur5b_MbmYRwEMqRxjOGvVrERb0HRFBxxjgqkNs0ody9_LXANA09I0W4sfNU7PxQYgdtsnwg1zJfo6dloufKMZaWXR4iRaiYYvQvm7048hvkBeu0d47jvWocc3Sev2XTEHXGcxnCYUC3OzVUxGpdReCiiiGL-Q26lqHj/w472-h640/isthereanopenaccesscitationadvantage-1.png" width="472" /></a></div><br /><p>First, we look at the traditional status quo search, the first four results are bad. </p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixxHknMD6zbhko781RbZK9tmPSFKbrbDyOIezgg5xSK2k5vGDuejOpC2xOTm8wizquIXdVb4k11fl9LCeOvyrpizWHGoMHnjw3uQfuADE3nz6HQgsH4GauN9DoDn8nxp9RSK2kNM4_0t9BVgYYr7L9RT7u1tMqU1XVYMfwccEFJz0X6zRdI09wQJ65HWnO/s955/isthereanopenaccesscitationadvantage-2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="955" data-original-width="567" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEixxHknMD6zbhko781RbZK9tmPSFKbrbDyOIezgg5xSK2k5vGDuejOpC2xOTm8wizquIXdVb4k11fl9LCeOvyrpizWHGoMHnjw3uQfuADE3nz6HQgsH4GauN9DoDn8nxp9RSK2kNM4_0t9BVgYYr7L9RT7u1tMqU1XVYMfwccEFJz0X6zRdI09wQJ65HWnO/w380-h640/isthereanopenaccesscitationadvantage-2.png" width="380" /></a></div><div><br /></div><div>The results from the experimental, Semantic Search are much better with the first two results being clearly relevant. </div><div><br /></div>But perhaps, the natural language search is causing issues. Let's try keyword search query<br /><p><b>Open Access citation advantage</b></p><p><br /></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68IYzVtWcCQx40-bGgsWkcnQMqBOADzJLoG7tuzsFa8IunjE1qXuMxKC0Y3A-LvmNkfrkUotHWNrsDVS-p9taYTR8Q64NGIQJ4LGSHl8M1aLjHFj8IGTlHKI3RPeBp0Wg6jD_NP9RWlIJPoguxn3v7oQe3f9h-Iv3ISpXVmFST9Gzh1dKa8os0GMZEuH4/s1000/citationadvantage-1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1000" data-original-width="566" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi68IYzVtWcCQx40-bGgsWkcnQMqBOADzJLoG7tuzsFa8IunjE1qXuMxKC0Y3A-LvmNkfrkUotHWNrsDVS-p9taYTR8Q64NGIQJ4LGSHl8M1aLjHFj8IGTlHKI3RPeBp0Wg6jD_NP9RWlIJPoguxn3v7oQe3f9h-Iv3ISpXVmFST9Gzh1dKa8os0GMZEuH4/w362-h640/citationadvantage-1.png" width="362" /></a></div><div class="separator" style="clear: both; text-align: left;">The results from the default search is just as bad. </div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe7okLSYENoE0gLjqCMnqq_7haNA03wP1yXCfqO5HMv9rYNs0J0wsIuWWBAAtH5K5dT1VASqw62ZE_6J8Vw3H5sCOUvZhyG6q3zJqbef7AMHOm5JH3BpAf5odZeqzi6SkrjFl5BUJO0Rv0LX8-_6Jyg-u4Ezabo5WhE8fiGz53TQfHvE_IVQgcg_z5ZBm3/s1007/citationadvantage-2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1007" data-original-width="563" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhe7okLSYENoE0gLjqCMnqq_7haNA03wP1yXCfqO5HMv9rYNs0J0wsIuWWBAAtH5K5dT1VASqw62ZE_6J8Vw3H5sCOUvZhyG6q3zJqbef7AMHOm5JH3BpAf5odZeqzi6SkrjFl5BUJO0Rv0LX8-_6Jyg-u4Ezabo5WhE8fiGz53TQfHvE_IVQgcg_z5ZBm3/w358-h640/citationadvantage-2.png" width="358" /></a></div><div><br /></div>Interestingly, the new experimental search result is still superior to the default search but seems to be worse than if we did natural language! <br /><p>In case you are wondering, I actually tried a few examples and almost every time the new experimental search gave better results or at worse similar results! It's really interesting....</p><p><br /></p><h2 style="text-align: left;">Implications of Semantic Search feature and JSTOR's clever new feature</h2><p>I know a lot of librarians, particularly evidence synthesis librarians reading this are probably very concerned. Already<a href="https://papyrus.bib.umontreal.ca/xmlui/handle/1866/28262?s=09 "> they are worrying over the lack of precision due to auto-indexing of MeSH</a>...</p><p>In fact, I have had librarians come up to me after talks asking me advice for how to advocate to database vendors to keep lexical search (or more accurately power user boolean search functionality). and not replace them with semantic search.</p><p>Evidence synthesis librarians clearly desire the precision and predictability that lexical boolean search provides. </p><p>That said JSTOR generative Ai pilot has implemented a pretty clever feature to help mitigate one of the drawbacks of using semantic search over keyword/lexical search.</p><p>With pure lexical search, you can just look at the search engine context and look at the highlighted words to quickly determine if a paper is relevant. But this works less well with semantic search since it may not always match the keyword in your query term.</p><p>As a trivial example imagine you searched "automobile US", the semantic search recognizes the "US" to mean America here and gives you documents due to that. But straight forward keyword highlighting won't work here. This is just a toy example, but you get the idea.</p><p>JSTOR generative AI tries to mitigate this issue in a clever way I have not seen anyone else do. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4_biOOlXjgXTx2xw1S5HdZaIZSv-Eufd-WmIYYBt6jMXTHI0qwRqswvf-7tZWQ9ov4lY9q9U6uW0gGKEYJRJlnzgRbX09bQIXpFDWInlzkVzdqHD75GDcAN6syaNGt2-XQ0b4eDgkOuViDWC8c2Rej0mk-UkB-l2N4ZAeEwJQ_q7F8KY8nQnpwr415mNs/s1816/aijstor.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="532" data-original-width="1816" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi4_biOOlXjgXTx2xw1S5HdZaIZSv-Eufd-WmIYYBt6jMXTHI0qwRqswvf-7tZWQ9ov4lY9q9U6uW0gGKEYJRJlnzgRbX09bQIXpFDWInlzkVzdqHD75GDcAN6syaNGt2-XQ0b4eDgkOuViDWC8c2Rej0mk-UkB-l2N4ZAeEwJQ_q7F8KY8nQnpwr415mNs/w640-h188/aijstor.png" width="640" /></a></div><br /><p>Like many other tools, JSTOR implements what I call a "Talk with PDF/content feature" where you can ask predefined questions about the full-text in one document including as of time of writing</p><p></p><ul style="text-align: left;"><li>What is this text about?</li><li>Recommend topics?</li><li>Show me related content</li><li>Ask a question about this text <open ended for you to type></li></ul><div>This isn't too novel. What is interesting is that when you click in from any search result page in JSTOR it will automatically prompt the language model</div><div><br /></div><div></div><p></p><blockquote><p></p><div>"how is <query> related to this text" </div><p></p><p></p></blockquote><p>This is very clever and assuming it is reliable you can quickly tell if a paper is relevant or not!</p><p>I have tried it quite a bit so far it seems quite accurate. I love the fact that despite the positive phrasing of the prompt "How does <query> relate to" implying we expect the answer to be yes , it is often ready to say "No it is not related"!</p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLCQKfStDM6qxKsBS0MKsHhN9blcuWD2Z9rLmg1fOD6vOnXxV9XTXes3hSGRbyd4SsaphdH2srSlIn05aw994TNe4m1ZhftwP4NaODzrzS1yTHK6HoqZW9wVUslxfDA7Tfzj1G6YnT0-iFJfNoKXXhXRuoyFtt-96CWAxxmiXbCOMnNFYsZNPoCO_umtom/s737/aijstr3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="96" data-original-width="737" height="84" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLCQKfStDM6qxKsBS0MKsHhN9blcuWD2Z9rLmg1fOD6vOnXxV9XTXes3hSGRbyd4SsaphdH2srSlIn05aw994TNe4m1ZhftwP4NaODzrzS1yTHK6HoqZW9wVUslxfDA7Tfzj1G6YnT0-iFJfNoKXXhXRuoyFtt-96CWAxxmiXbCOMnNFYsZNPoCO_umtom/w640-h84/aijstr3.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzxJo6Cdr7mlLQSguwRBDF8blp5xRs9QKp49TvI3CpJQQK8Ffrj95pJdMWxQaXDvrRkQN1oud0D7DKU-jhs2k1IzSKltok1J_MWMYK2Q1V3x2WNvwpYM5xAPeCHE3-8amYO3yO5jG9us3pO_4JvebyNGwGEbrR9lU5j9CBkaOpqE8E3GziJK2picCHf4Oe/s717/aijstr4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="118" data-original-width="717" height="106" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgzxJo6Cdr7mlLQSguwRBDF8blp5xRs9QKp49TvI3CpJQQK8Ffrj95pJdMWxQaXDvrRkQN1oud0D7DKU-jhs2k1IzSKltok1J_MWMYK2Q1V3x2WNvwpYM5xAPeCHE3-8amYO3yO5jG9us3pO_4JvebyNGwGEbrR9lU5j9CBkaOpqE8E3GziJK2picCHf4Oe/w640-h106/aijstr4.png" width="640" /></a></div>It occurs to me that in the usual case where it says a document is related to the search query, it provides evidence where you can mouse-over and see the text for verification purposes.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu2fpWRWwRRCnpvhatXLfQUdOpH37g88R4eTaCJR_ClQbGJmQk4biS6KMoppgtrVEs9csXZxb7P_8b4hy1IcmyiQekJ-LjUbQwwtWs60THld-HspBV61vdBQCz9Se8kDQZyXBkcQe8u_591jrVGERJW6pT9jli1NVKu3NQa4YR90K6a7mhz-S8cY7qZhs_/s809/jstorexperimentalsearch-9.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="412" data-original-width="809" height="326" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhu2fpWRWwRRCnpvhatXLfQUdOpH37g88R4eTaCJR_ClQbGJmQk4biS6KMoppgtrVEs9csXZxb7P_8b4hy1IcmyiQekJ-LjUbQwwtWs60THld-HspBV61vdBQCz9Se8kDQZyXBkcQe8u_591jrVGERJW6pT9jli1NVKu3NQa4YR90K6a7mhz-S8cY7qZhs_/w640-h326/jstorexperimentalsearch-9.png" width="640" /></a></div><br /><div><br /></div><div><br /></div><div>However, when it says it is not relevant, there is no easy way to check to see if it's not a false negative!<br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjydmphIpizmUpYk5nWGIXb37zsSE5kv6KGhNBemBZV-4axAFUt6vA3Xdio2FnXDvrSkHB1fczxvAr7PR6w_Ze6_o75lH5Sw6-PeQ779qZKvRVW0j5rI1MStY8lraZ4zUSHoVG-adipph5ELsd9ZeGzKswaofUWo-3vtBOCRrgm3MXZMl_wA59GKBMMu924/s1843/jstorexperimentalsearch-8.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="513" data-original-width="1843" height="178" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjydmphIpizmUpYk5nWGIXb37zsSE5kv6KGhNBemBZV-4axAFUt6vA3Xdio2FnXDvrSkHB1fczxvAr7PR6w_Ze6_o75lH5Sw6-PeQ779qZKvRVW0j5rI1MStY8lraZ4zUSHoVG-adipph5ELsd9ZeGzKswaofUWo-3vtBOCRrgm3MXZMl_wA59GKBMMu924/w640-h178/jstorexperimentalsearch-8.png" width="640" /></a></div><br /><h2 style="text-align: left;"><br /></h2><h2 style="text-align: left;">Conclusion</h2><p>There are two general ways to find papers. First is good old fashioned keyword search. Second is using citations. For the second method, I have <a href="https://musingsaboutlibrarianship.blogspot.com/p/list-of-innovative-literature-mapping.html">tracked the rise of tools like Research Rabbit, Connected Papers, Litmaps from the 2020s</a> that make it much easier to use this method than before, the fruits of a successful campaign to make scholarly metadata openly available in machine readable format (particularly citations/references).</p><p>Now we see the rise of a third way, semantic similarity and in fact tools like <a href="https://twitter.com/aarontay/status/1726925863168692375">litmap are adding some semantic function based on similarity though currently only on title and abstract.</a></p><p>Hopefully support can also come from the "semantic side", so that one can prompt or do searches like. Take paper A, find the 5 most semantic similar papers, extract all citations to these 5 new papers and dedupe or vice versa.</p><p>Another interesting thought occurs to me. Tools like Elicit, scispace are now charging per use/credit. We obviously understand why since currently the inference cost of running a search the cost isn't trivial whether it is paying for API calls or running the compute on your local or remote cluster. </p><p>This brings to mind the old days of Dialog where librarians do mediated searching for users. In those days, where you build up search strategies with nested boolean of the form</p><p>(A OR B OR C) AND (D OR E OR F)</p><p>where A,B,C and D,E,F are synonyms are must haves because of the need for efficient searches (since you are charged by usage). Yet the irony is many of these tools support Semantic Search so you can't even do nested boolean!</p><p><br /></p><p><br /></p><p> </p></div></div></div>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.com0tag:blogger.com,1999:blog-4727930222560708528.post-69030619437744088572023-10-17T03:36:00.003+08:002023-10-31T00:23:31.874+08:00ChatGPT Plus - new DALL-E 3 (image creation) & Vision (image recognition) capability - A quick overview & why I am disappointed.<p> On September 2023, <a href="https://openai.com/blog/chatgpt-can-now-see-hear-and-speak">OpenAI announced that ChatGPT Plus would be enhanced in three ways</a></p><p><br /></p><p>1. It would allow you to speak directly with GPT and it would also be able to reply in voice</p><p>2. It would be able to create images using DALL-E 3, OpenAI's image generation model</p><p>3. It would be able to accept image inputs</p><p><br /></p><p>Since I finally gained access to these features, I will briefly review them with my thoughts on how impactful they might be for library work. To anticipate, the conclusion, these features are very powerful but yet I am disappointed. Because each of these abilities are unlocked separately and cannot be combined. In other words, this is still not what I consider a true multi-modal model that can accept input in multi-modalities (eg text, audio, images) and output formats (eg text , audio , images)</p><p><b><span style="color: red;">Update Oct 30th 2023 </span>- Just about 10 days after I blogged this,<a href="https://twitter.com/rowancheung/status/1718642858960281781"> OpenAI announced an update that allows you to upload different docs and access to browsing, advanced data analysis and DALL-E 3 capabilities all without switching modes</a>. In other words, this makes it closer to a real multi-model model....</b></p><div class="separator" style="clear: both; text-align: center;"><b><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicF5vIOJp_zaVPtwx3wyQTpiXjm0TNOlC5vyxkCrbiiB-8UBhyphenhyphenGXR-5xdYrjF3Ad4N9ySfgki40ihqJoVL4HnHNduDBwbr7kJ6Ss2xFLDyA6pMdk5RutPIaHok1avpNT13mkFreKzcZCjqai1ZUaR3Bm4lAwVb3FecPenLH7uHVD0qDlEclSfwv1IZen5K/s553/GPT4-all.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="369" data-original-width="553" height="214" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEicF5vIOJp_zaVPtwx3wyQTpiXjm0TNOlC5vyxkCrbiiB-8UBhyphenhyphenGXR-5xdYrjF3Ad4N9ySfgki40ihqJoVL4HnHNduDBwbr7kJ6Ss2xFLDyA6pMdk5RutPIaHok1avpNT13mkFreKzcZCjqai1ZUaR3Bm4lAwVb3FecPenLH7uHVD0qDlEclSfwv1IZen5K/s320/GPT4-all.jpg" width="320" /></a></b></div><b><br /></b><p></p><p><br /></p><p><i>Note: There's a "new" - Browse with Bing feature as well. I put new in quotes because ChatGPT Plus came with at least two earlier versions of this plugin that enhanced ChatGPT with results from the web. The <a href="https://www.cmswire.com/digital-experience/openai-disables-chatgpt-bing-web-browser-plugin/">second version also based on Bing was taken down earlier this year because people found a way to use it to bypass paywalls.</a></i></p><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBKD-s6-rlOaNK1bxRO7TY_YjX4G4dVT2ZyhlRHxrafksUR9uewWnqEP_L-xvyV_qgjIfN9xhH7WLcB79RasxZuVAo7dUh3ei01Buo4Rv88QL3K3TC0aXwCy8LQ8LbL6ZUH81o2HSij7MVueWDYT82M0bjmch3n0Xaqka1ohXoZ79Q-sprtpCG1BJg1pkW/s400/chatgpt-browsewithbing.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="400" data-original-width="241" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgBKD-s6-rlOaNK1bxRO7TY_YjX4G4dVT2ZyhlRHxrafksUR9uewWnqEP_L-xvyV_qgjIfN9xhH7WLcB79RasxZuVAo7dUh3ei01Buo4Rv88QL3K3TC0aXwCy8LQ8LbL6ZUH81o2HSij7MVueWDYT82M0bjmch3n0Xaqka1ohXoZ79Q-sprtpCG1BJg1pkW/s320/chatgpt-browsewithbing.png" width="193" /></a></div><br /><p><i>My quick tests show it is nothing special, we have been playing with similar features in Bing Chat, Perplexity.ai (and of course my blog charts the progress of academic search engines that use RAG), and there's nothing that suggests Browse with Bing is notably superior.</i></p><h2 style="text-align: left;"><b>1. </b> It would allow you to speak directly with GPT and it would also be able to reply in voice</h2><div>I don't have full access to this capability, my Android ChatGPT app allows me to talk to it (with very high accuracy - based on Whisper?) but it does not respond back in voice.</div><div><br /></div><div>This capability is probably the least impactful in my view since we had smart assistants for a while now with pretty good voice recognition. That said these smart assistants were always dumb in the responses they gave so perhaps it will feel totally different if they respond intelligently with voice!</div><div><br /></div><div>For example - Tim Spalding is very impressed by the voice feature saying it gives "Jarvis vibes" and is comparing it to the first time he used a Mac, Gopher and World Wide Web and even ChatGPT itself!</div><blockquote class="twitter-tweet"><p dir="ltr" lang="en">The new “Voice Conversations” on the ChatGPT app is… well, I think this goes up there with the first time I used a Mac, Gopher, the World Wide Web, and ChatGPT itself. Serious Jarvis vibes.</p>— Tim Spalding 🇺🇦 (@librarythingtim) <a href="https://twitter.com/librarythingtim/status/1710810082265481320?ref_src=twsrc%5Etfw">October 8, 2023</a></blockquote><p>If he is correct, and he is a very smart dude on such matters, once such technologies are in smart assistant/homes, we will be amazed.</p><p><br /></p><h2 style="text-align: left;"> 2. It would be able to create images using DALL-E 3, OpenAI's image generation model</h2><div>When OpenAI first launched DALL-E in 2021 and <a href="https://openai.com/dall-e-2">DALL-E 2</a> in 2022 people were amazed. This were two ground-breaking text to image generators.They were quickly followed by competitor's such as Google's Imagen (against the original DALL-E) and <a href="https://stability.ai/stable-diffusion">Stable Diffusion</a> and <a href="https://docs.midjourney.com/docs/quick-start">Midjourney</a> (against DALL-E 2).</div><div><br /></div><div>In particular, Stable Diffusion grabbed a lot of attention by being available Open Source and in terms of capability Stable Diffusion and Midjourney (commercial) among others seems to have improved rapidly to surpass DALL-E 2's capabilities.</div><div><br /></div><div>I have been particularly impressed by Stable Diffusion's capabilities including text to image, inpainting and outpainting. (<a href="https://clipdrop.co/stable-diffusion?utm_campaign=stable_diffusion_promo&utm_medium=cta_button&utm_source=stability_ai">try free here</a>), though it might be some of it's capabilities comes from the fact that Stable diffusion is trained on an extreme number of images scraped from the web and has less guardrails to prevent 'unsafe' images from being generated.</div><div><br /></div><div>However, OpenAI has finally struck back with DALL-E 3, and it claims to nail one of the last weaknesses of text to image generators. Up to this point, you could describe an image and these tools would be pretty good at understanding what you want, but if you asked it to create an image of something with the words "Happy Birthday" at the bottom, most of them would fail terribly at generating the words.</div><div><br /></div><div>To use DALL-E 3 in ChatGPT Plus you need to select a special mode - DALLE-3</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGLitvHvhtaiKv9tdJ3vrLTW17mzLK2iT4JJ0TXFH1B2B-dddrSmzCWgDB31-eelBVT_yhfcsxxy3DFO76PuANPFIc7nEp0gBg8E78t7PMOLGDmfwS5G8aQK_y2QQV0zw1hmn_LaBKK3JU7QLTxYsh58LlDn5z6Qe4lM_ijoFYhsmLvgppqqaEIWGHsftP/s399/chatgpt-dalle3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="399" data-original-width="269" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgGLitvHvhtaiKv9tdJ3vrLTW17mzLK2iT4JJ0TXFH1B2B-dddrSmzCWgDB31-eelBVT_yhfcsxxy3DFO76PuANPFIc7nEp0gBg8E78t7PMOLGDmfwS5G8aQK_y2QQV0zw1hmn_LaBKK3JU7QLTxYsh58LlDn5z6Qe4lM_ijoFYhsmLvgppqqaEIWGHsftP/s320/chatgpt-dalle3.png" width="216" /></a></div><br /><div><br /></div><div><br /></div><div><br /></div><div>I did a couple of tests using the prompt - </div><div><div><br /></div><div>"Draw a picture of a Terminator robot from the moves face to face with a human librarian At the bottom are the words "AI vs Human" and repeated it twice. This is what ChatGPT with DALL-E plugin shows.</div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHoJU3S3syBUgJwOolqcr24Xfs4JryYHHneobaPfXBn7fCBIZ7tADQ3cEak9oCZq7B2V1-kOCqcTijSelVTbt_SPkQ6dD8pO1LE1F4R_P0BmXdir79Cuffhl0B_MYpG9uWcGxOx1EHmvUhyphenhyphenTD16neqQ_R2NjMhU62bWpmRONTYbExFkooczeTYtaTMpCCA/s653/dalle-1.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="619" data-original-width="653" height="606" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhHoJU3S3syBUgJwOolqcr24Xfs4JryYHHneobaPfXBn7fCBIZ7tADQ3cEak9oCZq7B2V1-kOCqcTijSelVTbt_SPkQ6dD8pO1LE1F4R_P0BmXdir79Cuffhl0B_MYpG9uWcGxOx1EHmvUhyphenhyphenTD16neqQ_R2NjMhU62bWpmRONTYbExFkooczeTYtaTMpCCA/w640-h606/dalle-1.png" width="640" /></a></div><br /><div><br /></div><div><br /></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN6RsLKH9VhpxqULI9FRk83jYny2lb389jZAtnUfw8EXMUfW2P67YWhCQWvHh8UonAkDKiBGVZ72Q6dzY8L5mRO3r88ImVLW9mYpuqZUVnbAhkGa3LZ9PTed3YSwH2S_LHsJpgsoKX7xpj8hP4PCkzoneRkRo2_wnJQJk0c97AEJK6wL1mCqBRvylPO0lb/s651/dalle-2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="612" data-original-width="651" height="602" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhN6RsLKH9VhpxqULI9FRk83jYny2lb389jZAtnUfw8EXMUfW2P67YWhCQWvHh8UonAkDKiBGVZ72Q6dzY8L5mRO3r88ImVLW9mYpuqZUVnbAhkGa3LZ9PTed3YSwH2S_LHsJpgsoKX7xpj8hP4PCkzoneRkRo2_wnJQJk0c97AEJK6wL1mCqBRvylPO0lb/w640-h602/dalle-2.png" width="640" /></a></div><br /><div><br /></div><div><br /></div><div><br /></div><div>The results are not perfect, in the first batch of four, two don't even have the words! In the second batch of four, they all have the words, sort of anyway. The last one for some reason misspells it as HUIMAN. Still the fact it gets it sometimes right is impressive since most other image generators will totally fail most times.</div><div><br /></div><div>Because DALL-E 3 is now invoked via ChatGPT (or Bing Chat), you can modify the images using natural language. For example, you can ask it to change a male to a female or replace the terminator with a demon and it understands very well.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDjgUzF8CKOzStBfGvDSAb4fqYpSL-NrgGq63HfhrgS-pwlXnFAcLhxQjv3WTJ2uTHJn-UHrR71KOlTdy3o4rctQGh3r0xo6IRiaQ-Z26DKHHFDbEhLeFosNOn6EtaOrdSu8zv5fPAwoFQoXmw0W_8R4LFvLckCZI7otEzyVS7FNeYgv1IcVZmYZwdgLpO/s587/dalle3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="537" data-original-width="587" height="293" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjDjgUzF8CKOzStBfGvDSAb4fqYpSL-NrgGq63HfhrgS-pwlXnFAcLhxQjv3WTJ2uTHJn-UHrR71KOlTdy3o4rctQGh3r0xo6IRiaQ-Z26DKHHFDbEhLeFosNOn6EtaOrdSu8zv5fPAwoFQoXmw0W_8R4LFvLckCZI7otEzyVS7FNeYgv1IcVZmYZwdgLpO/s320/dalle3.png" width="320" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8pcka2eY_tfoJDZxn5SYo8U42yaN8Dz93FCj93ThhgmcEQ9_rS3kzu-WpB-QwvfG639dldPQ7-ZGCzNdQCGYz20KCM024nvaFwzKvgGsMKIdqSqVzGaDAFgHIVzX1upBr3b57XjpHF9Pio6MeHVbJF9ms9AWWzsTjLIMCNw_oMXpw76yeuROWsICUnqWL/s583/dalle-4.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="544" data-original-width="583" height="299" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj8pcka2eY_tfoJDZxn5SYo8U42yaN8Dz93FCj93ThhgmcEQ9_rS3kzu-WpB-QwvfG639dldPQ7-ZGCzNdQCGYz20KCM024nvaFwzKvgGsMKIdqSqVzGaDAFgHIVzX1upBr3b57XjpHF9Pio6MeHVbJF9ms9AWWzsTjLIMCNw_oMXpw76yeuROWsICUnqWL/s320/dalle-4.png" width="320" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><br /></div><div><br /></div><div>To be honest this wasn't the feature I was very excited to get in ChatGPT plus, because<a href="https://twitter.com/aarontay/status/1708148983183597983"> I had already tested the same function </a>which you can get free via <a href="https://www.bing.com/create">Bing Image Creator which is powered by DALL-E 3</a> and the results were similar. </div><div><br /></div></div><div>Also, while being able to specify text to add is cool, it isn't particularly difficult to add text labels to a generated graphic.... Though I suppose one possible workflow would be for the system to summarise a paper , the use that summarise to try to create a Scientific Poster or Visual abstract. But so far, I am not successful.</div><div><br /></div><div>For fun, I tried uploading a simple visual abstract to ChatGPT and using its vision capability it described the visual abstract. I then fed it to DALLE-3 to create the visual abstract. The results were weird...</div><div><br /></div><div>For example, while it could describe the following simple visual abstract well.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUkeGE8-ioH06dd7dIkJALtNByXJrhHsLgSsHNOfFN0BgU2oKlQjsE-MonB1uA6UvfGIt8uorqnpEf1D3O5Gm1eTGmxK1ZDUd0RfxKHZqTyeVDaOiN7lH3Iv4xZQqEo12g9qCaXxSIyxnXxBpi2L5dY4JxZ98YFUN8JQtmfEJ4FKnEwzYhYgm0A4PxFnvq/s1433/visualabstract-describe.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1046" data-original-width="1433" height="468" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjUkeGE8-ioH06dd7dIkJALtNByXJrhHsLgSsHNOfFN0BgU2oKlQjsE-MonB1uA6UvfGIt8uorqnpEf1D3O5Gm1eTGmxK1ZDUd0RfxKHZqTyeVDaOiN7lH3Iv4xZQqEo12g9qCaXxSIyxnXxBpi2L5dY4JxZ98YFUN8JQtmfEJ4FKnEwzYhYgm0A4PxFnvq/w640-h468/visualabstract-describe.png" width="640" /></a></div><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhyv9XNkKZv2gn3cClaAzxVaVkGLVw8ntvC2hJ5bDp_ejgJ16szCxNR6jVze4SpDOyIrwd8p0p0odXuiEgrkulPsDhTnnAJYqI6gz06EwTt_KZiGm30-kTR_Yxbd5BxyqItgJ8QVV6yp-kIZY8W_ONGj2M2w_NBDT6O31zrqn9iP1t3xbAeS6BDUs904Q8/s795/visualabstract-describe-2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="629" data-original-width="795" height="506" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhhyv9XNkKZv2gn3cClaAzxVaVkGLVw8ntvC2hJ5bDp_ejgJ16szCxNR6jVze4SpDOyIrwd8p0p0odXuiEgrkulPsDhTnnAJYqI6gz06EwTt_KZiGm30-kTR_Yxbd5BxyqItgJ8QVV6yp-kIZY8W_ONGj2M2w_NBDT6O31zrqn9iP1t3xbAeS6BDUs904Q8/w640-h506/visualabstract-describe-2.png" width="640" /></a></div><br /><div>Feeding the same description to create a visual abstract with DALL-E 3 leads to weird results.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHTAN0ukWap7wQrGomWAEWAJMaZB0aeiOdWmrfckEdliPSZslaVkqfniL3H7M7xSx3UIC4ENZxUB-KHBtOYWtLDDS6Dkd3zM7bN7dbJGedTYVtgHD7g6c9nChPOd9WThQzuErnE_sS0PGFhCUtORU6XZK91Hflcau2Y1Kz-TA706nmv7cFFZ3N6924nmnf/s707/visualabstract-describe-3.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="707" data-original-width="677" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjHTAN0ukWap7wQrGomWAEWAJMaZB0aeiOdWmrfckEdliPSZslaVkqfniL3H7M7xSx3UIC4ENZxUB-KHBtOYWtLDDS6Dkd3zM7bN7dbJGedTYVtgHD7g6c9nChPOd9WThQzuErnE_sS0PGFhCUtORU6XZK91Hflcau2Y1Kz-TA706nmv7cFFZ3N6924nmnf/s320/visualabstract-describe-3.png" width="306" /></a></div><br /><div><br /></div><div><br /></div><div>Stable Diffusion I think is still more capable in some ways than DALL-E-3 because it is less filtered. For example, you can easily create photos based on celebrities or <a href="https://stable-diffusion-art.com/consistent-face/#Multiple_celebrity_names">even create faces that are 20% celebrity X and 80% Celebrity Y</a>, while Dall-E-3 will refuse to generate images of any individuals.</div><div><br /></div><div>The image data that Stable Diffusion trained on is very broad, it even includes an image of me! For example, based on the "<a href="https://haveibeentrained.com/">Have I been trained?</a>" database, <a href="https://haveibeentrained.com/?search_text=%22Aaron%20Tay%22">there is at least one photo of me included!</a></div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKEw6OrVNfnJSa6IN9sjfBtXPJigHdTSIt4ce4yDxbfeZQl6yVbReuURtVbK-o9jTo8xBtOycguqI-szcdJPxnFKsLLT49cXx4E39ed10zHaLQFCUaTiM-FgxProM7doYfYkOS8W2M06poiKMx1iza4v93vAVF4O1inoRLKPfU4ob_s53ZvpM6AoEZE-Rg/s1869/Aarontay-Have%20I%20Been%20Trained_.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="834" data-original-width="1869" height="286" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKEw6OrVNfnJSa6IN9sjfBtXPJigHdTSIt4ce4yDxbfeZQl6yVbReuURtVbK-o9jTo8xBtOycguqI-szcdJPxnFKsLLT49cXx4E39ed10zHaLQFCUaTiM-FgxProM7doYfYkOS8W2M06poiKMx1iza4v93vAVF4O1inoRLKPfU4ob_s53ZvpM6AoEZE-Rg/w640-h286/Aarontay-Have%20I%20Been%20Trained_.png" width="640" /></a></div><br /><div><br /></div><div>Fortunately, or unfortunately there are far more photos of other "Aaron Tay", so when such systems try to generate a prompt with input "Aaron Tay" it is unlikely to get an image close to me (though it will likely generate Chinese facial features). For celebrities it pretty much nails it of course since almost all the training images will be of their likeness (try say <a href="https://haveibeentrained.com/?search_text=Angelina%20jolie">Angelina Jolie</a>)</div><div><br /></div><div>DALL-E 3 when used via ChatGPT plus also has a host of other restrictions, if <a href="https://twitter.com/aarontay/status/1713972855384490301">the system prompt here is accurate</a> other restrictions include instructing the model to </div><div><br /></div><div><ul style="text-align: left;"><li>not "create images in the style of artists whose last work was created within the last 100 years"<br /></li><li>not "create any imagery that would be offensive."</li><li>"Silently modify descriptions that include names or hints or references of specific people or celebritie by carefully selecting a few minimal modifications to substitute references to the people with generic descriptions that don't divulge any information about their identities, except for their genders and physiques"<br /><br /></li></ul></div><div><br /></div><h2 style="text-align: left;">3. It would be able to accept image inputs</h2><div>This is probably the function that is getting the most attention right now. This gives ChatGPT the ability to understand images you upload.</div><div><br /></div><div><a href="https://openai.com/blog/chatgpt-can-now-see-hear-and-speak">OpenAI says</a></div><div><blockquote>Image understanding is powered by multimodal GPT-3.5 and GPT-4. These models apply their language reasoning skills to a wide range of images, such as photographs, screenshots, and documents containing both text and images.</blockquote></div><div>People are really impressed by this capability and so am I. For one, it can read screenshots of English test, figures, tables from papers. This is clearly going to be useful for interpreting papers.</div><div><br /></div><div>Twitter/X is full of amazing examples, here I try an example by uploading a visual abstract created by my colleagues.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi20GdyIcHAtp6wK2OaFFQgv9Ur-FmDo9PeJPHohaTDd6BK59oWZmOsq97-Y1fBi7btDMLe8elsWBctcx3DBYZ7JPNuWFRQZeUeXbAHE8RPI2J7hiiPPGE14pzfRTS77ZRa6gedd8UQp197YiL4h2aHxoe4-fI3exfrQJm-rJM3yffT4sli6M6WIil8iStV/s1211/dalle-visualabstract.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="659" data-original-width="1211" height="348" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi20GdyIcHAtp6wK2OaFFQgv9Ur-FmDo9PeJPHohaTDd6BK59oWZmOsq97-Y1fBi7btDMLe8elsWBctcx3DBYZ7JPNuWFRQZeUeXbAHE8RPI2J7hiiPPGE14pzfRTS77ZRa6gedd8UQp197YiL4h2aHxoe4-fI3exfrQJm-rJM3yffT4sli6M6WIil8iStV/w640-h348/dalle-visualabstract.png" width="640" /></a></div><br /><div>This is what ChatGPT with vision sees.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6dC-bcU-OcXIWe2S6gdiPq5sW1OM64zcODINDTXePSuMFZNVlcN6APTUxleAHhBIfVfckesLMrY0fuERUUoRQUuS1QCwedrx9Tp8e5h0VegO_2ELA-j9IB8FS232C1lk1Ag96VWoEYbDchi1m_ZFOqRxyQUhTHIhiJbD7gvRB8kD6nL0tjed03jBZcaKK/s1010/dalle-visualabstract-2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="1010" data-original-width="678" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj6dC-bcU-OcXIWe2S6gdiPq5sW1OM64zcODINDTXePSuMFZNVlcN6APTUxleAHhBIfVfckesLMrY0fuERUUoRQUuS1QCwedrx9Tp8e5h0VegO_2ELA-j9IB8FS232C1lk1Ag96VWoEYbDchi1m_ZFOqRxyQUhTHIhiJbD7gvRB8kD6nL0tjed03jBZcaKK/w430-h640/dalle-visualabstract-2.png" width="430" /></a></div><br /><div>I tried a <a href="https://twitter.com/aarontay/status/1713865247273205969">few more examples of visual abstracts</a> and it is not perfect but still impressive</div><div><br /></div>
<blockquote class="twitter-tweet" data-conversation="none"><p dir="ltr" lang="en">Okay I tried this visual abstract on GPT4 vision. It has trouble understanding that there is an icon for RCT and Population based Cohorts. It thinks that double arrow icon means < (less than) , (1) <a href="https://t.co/K9b6n5r4M1">pic.twitter.com/K9b6n5r4M1</a></p>— Aaron Tay (@aarontay) <a href="https://twitter.com/aarontay/status/1713865247273205969?ref_src=twsrc%5Etfw">October 16, 2023</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<blockquote class="twitter-tweet" data-conversation="none"><p dir="ltr" lang="en">I asked GPT4 for the 95% CI for intubation, at first it correctly refused saying it isn't labelled. I ask it to estimate anyway and it basically makes things up (multiple tries) - (2) <a href="https://t.co/4ZTAvAvyFQ">pic.twitter.com/4ZTAvAvyFQ</a></p>— Aaron Tay (@aarontay) <a href="https://twitter.com/aarontay/status/1713866127108112584?ref_src=twsrc%5Etfw">October 16, 2023</a></blockquote> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script>
<div><br /></div><div><br /></div><h2 style="text-align: left;">A true multi-model Large Language Model will be amazing</h2><div>Despite all the amazing new capabilities, I was still somewhat disappointed because each of these capabilities can only be used seperately.</div><div><br /></div><div>By default, when you start a prompt in ChatGPT you must choose from the following options</div><div><br /></div><div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEj1nLHu_ccPy2bMNLvoTmiUDKWl8ZOzaXZ61fUQmir49EePaD6h5aZ-pqyYBMYajhWxSq5-4SNpbOGaCvN_m7ZI4tM-h-ntQHt8f1T79Qa8JAAlxMKFtsS3UobZhm2RQyWNEN7J45_pEuKEcgCDX5bhrh4r1-xJal6ogHb3Cd3dHCrupwmsXd0KjQ5Ivh2Y" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="320" data-original-width="193" height="240" src="https://blogger.googleusercontent.com/img/a/AVvXsEj1nLHu_ccPy2bMNLvoTmiUDKWl8ZOzaXZ61fUQmir49EePaD6h5aZ-pqyYBMYajhWxSq5-4SNpbOGaCvN_m7ZI4tM-h-ntQHt8f1T79Qa8JAAlxMKFtsS3UobZhm2RQyWNEN7J45_pEuKEcgCDX5bhrh4r1-xJal6ogHb3Cd3dHCrupwmsXd0KjQ5Ivh2Y" width="145" /></a></div><div class="separator" style="clear: both; text-align: center;"><br /></div>Each of these modes are mutually exclusive. There's isn't a mode you can choose for uploading images, but it only appears when you choose "Default"</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVeaMcOkd9lejIMMJUFT6xJS36PYymfQr0-cuwtPb1onnd4OuUTMHiaQUmA13hCIV7vImlEMAcoJZe4pvf4hyphenhyphenKOBQQU537Fu8v4tCdqd_cnTS5J27jmMAD91z5i5j-9OQH441WuRN0h6NxZHjjlQXOcgnLWaD7dQDgXv11NzxpuzAOi9qd_R2KmdIFEXlU/s575/chatgpt-vision.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="271" data-original-width="575" height="302" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjVeaMcOkd9lejIMMJUFT6xJS36PYymfQr0-cuwtPb1onnd4OuUTMHiaQUmA13hCIV7vImlEMAcoJZe4pvf4hyphenhyphenKOBQQU537Fu8v4tCdqd_cnTS5J27jmMAD91z5i5j-9OQH441WuRN0h6NxZHjjlQXOcgnLWaD7dQDgXv11NzxpuzAOi9qd_R2KmdIFEXlU/w640-h302/chatgpt-vision.png" width="640" /></a></div><div style="text-align: center;">Option to upload image only appears if you select default GPT-4 mode. Any other mode this option is not there</div><div><br /></div><div>This is such a shame, because I was looking forward to combining ChatGPT's new vision capabilities with the existing "Advanced Data Analysis" capabilities (formerly known as Code Interpreter).</div><div><br /></div><div>As <a href="https://musingsaboutlibrarianship.blogspot.com/2023/08/gpt4code-interpreter-playing-electronic.html">I covered in past post</a> , this is the mode that adds a "Code Sandbox" for GPT.</div><div><br /></div><div><div>OpenAI <a href="https://openai.com/blog/chatgpt-plugins#code-interpreter">describes it as</a></div><div><blockquote>We provide our models with a working Python interpreter in a sandboxed, firewalled execution environment, along with some ephemeral disk space. Code run by our interpreter plugin is evaluated in a persistent session that is alive for the duration of a chat conversation (with an upper-bound timeout) and subsequent calls can build on top of each other. We support uploading files to the current conversation workspace and downloading the results of your work.</blockquote></div></div><div>This mode also allows you to upload all types of files including csv, text, Python scripts, PDFs. It is then capable of running code in the Code Sandbox to extract information about the files you upload.</div><div><br /></div><div>In the earlier blog post, I uploaded <a href="https://researchdata.smu.edu.sg/articles/dataset/Data_and_code_for_Does_bedtime_music_listening_improve_subjective_sleep_quality_and_next-morning_well-being_in_young_adults/21252285">a zipped file of a research dataset that was already deposited into our data repository</a> and <a href="https://musingsaboutlibrarianship.blogspot.com/2023/08/gpt4code-interpreter-playing-electronic.html#rdm">asked it to analyse the files.<br /></a></div><div><br /></div><div>In particular, I ask it to interprete the files uploaded and suggest ways to improve the quality of the deposit.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC-tSrGTfRWvukRqTI_iqnCN2gRGIbSihTxxbB6gTodQth2vHUiAu6jd1S9p81kcXiTztcWgTimmGUSNJLMdZu5-h4IFJa-nF2yf8VpHi_tn6xvT77hEF_g-Sx3ZH0mNkPP8BMVP4kn5S3mBVl-XzsM6ynh5KNlzwjKH9Q2h_LLAPUaZ8mFjeiJbYGaBTt/s800/unsub-analysis7.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="669" data-original-width="800" height="536" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgC-tSrGTfRWvukRqTI_iqnCN2gRGIbSihTxxbB6gTodQth2vHUiAu6jd1S9p81kcXiTztcWgTimmGUSNJLMdZu5-h4IFJa-nF2yf8VpHi_tn6xvT77hEF_g-Sx3ZH0mNkPP8BMVP4kn5S3mBVl-XzsM6ynh5KNlzwjKH9Q2h_LLAPUaZ8mFjeiJbYGaBTt/w640-h536/unsub-analysis7.png" width="640" /></a></div><div><br /></div><div>Using its python interpreter, it could load csv files of codebooks, data files and handle it pretty well.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKsfsfcbnvgEuXk2DFcjSPqCp6vywouCrSwF_iyyA1RCPWiVwhtcNL_xre0a0uDCY4RoRDhG0uKMvPKaOaVQQzhuhpSlsdomSpzZ6nQQfrPvIEWhEcRnHeXzZDmmeceuMf8JDKC11PTdG8se2uXvipl1ShiOPH-6RiXKeqceHLu4lYWl_HwAgiXpWBUeK-/s616/unsub-analysis2.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="539" data-original-width="616" height="280" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgKsfsfcbnvgEuXk2DFcjSPqCp6vywouCrSwF_iyyA1RCPWiVwhtcNL_xre0a0uDCY4RoRDhG0uKMvPKaOaVQQzhuhpSlsdomSpzZ6nQQfrPvIEWhEcRnHeXzZDmmeceuMf8JDKC11PTdG8se2uXvipl1ShiOPH-6RiXKeqceHLu4lYWl_HwAgiXpWBUeK-/s320/unsub-analysis2.png" width="320" /></a></div><div><br /></div>However, when it comes to images (figures) or PDF, it has to use Python libraries to try to extract the text with uneven results to 'see' what is in there.<div><br /></div><div>It seems to me if the newest image/vision recognition mode was included into this mode, results would be much better!<br /><div><br /></div><div>Similarly, the ability of ChatGPT Plus to interpret image and draw images are two separate modes that currently can't be combined.</div><div><br /></div><div>In other words, you can upload an image for it to be described but you can't then use DALL-E 3 to edit it. This is a surprising weakness since other competitors like Stable diffusion do allow this.</div><div><br /></div><div>All in all, I expect when LLMs are truly multimodal and can accept input in different formats (e.g. text, audio, images, video) and generate output in diff formats (e.g. text, audio, image , video), we going to have even more wild use cases.</div><div><br /></div><div>Imagine uploading a zipped file of different data formats and asking it to analyse then amend and generate analysis in anything from text to image to videos! The possibilities are endless from writing/editing a paper, creating a short summary of a paper in a poster , video etc. </div><div><br /></div><div>True <a href="https://www.searchenginejournal.com/google-gemini-what-we-know-so-far/496494/">multmodal language models are rumored to be coming next in Google's next generation large language model -Gemini (due 3Q/4Q 2023)</a>, so we shall see what the future brings...</div><div><br /></div><div>Also see the latest GPT4 update@ 30 October 2023/</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLGJsqKkqB-ZYN7uVnmZif_Vyyv42oO2zRs_r_ZANjXVSmgUFO1hEwM26mgl6L6GFJ1wLCE4O6qTEHjk1umGJDYDt0nydJ_vsA_X-w9fmM_vHGVb-IJmgnOZ1RrU9a3-aUOgnvCGAuk79OeOV2e_vZrTx0H5YjYVaLsOZUirZnXEQNTxDR5a4mwSO5TEmP/s553/GPT4-all.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="369" data-original-width="553" height="428" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLGJsqKkqB-ZYN7uVnmZif_Vyyv42oO2zRs_r_ZANjXVSmgUFO1hEwM26mgl6L6GFJ1wLCE4O6qTEHjk1umGJDYDt0nydJ_vsA_X-w9fmM_vHGVb-IJmgnOZ1RrU9a3-aUOgnvCGAuk79OeOV2e_vZrTx0H5YjYVaLsOZUirZnXEQNTxDR5a4mwSO5TEmP/w640-h428/GPT4-all.jpg" width="640" /></a></div><br /><div><br /></div><div><br /></div> <script async="" charset="utf-8" src="https://platform.twitter.com/widgets.js"></script></div>Aaron Tayhttp://www.blogger.com/profile/02750645621492448678noreply@blogger.com0