Search Solutions 2022 - Tutorials
Search Solutions is the BCS Information Retrieval specialist group's annual event focused on practitioner issues in the area of search and information retrieval.
Tutorials are for both full day (5-6 hours including breaks and lunch) and half day (2-3 hours including breaks). The tutorials will take place on Tuesday 22nd November 2022 at the BCS offices in London and/or online depending on the situation near the time. We encourage in person tutorials at the BCS offices if possible.
IR From Bag-of-words to BERT and Beyond through Practical Experiments
- Sean MacAvaney, University of Glasgow, Email: email@example.com
- Craig Macdonald, University of Glasgow, Email: firstname.lastname@example.org
- Nicola Tonellotto, University of Pisa, Email: email@example.com
In this tutorial, you will build up their knowledge of information retrieval from the basics up to the latest BERT-based techniques. Moreover, hands-on exercises will give give you practical experience using these techniques. By the end of the tutorial, you will be familiar with the latest techniques, including neural language models for re-ranking, learned sparse retrieval, and dense retrieval.
10:00 - 11:30 Part 1 Presentation (a): Indexing, Retrieval, Evaluation
11:00 - 11:15 Morning break
11:15 - 11:45 Part 1 Presentation (b): Learning-to-rank
11:45 - 12:15 Part 1 Lab
12:15 - 13:00 Lunch break
13:00 - 14:00 Part 2 Presentation: Neural re-ranking
14:00 - 14:30 Part 2 Lab
14:30 - 14:45 Afternoon break
14:45 - 15:45 Part 3 Presentation: Learned sparse retrieval, Dense retrieval
15:45 - 16:15 Part 3 Lab
Attendees are required to bring their own computers/laptops for the lab component. All materials (e.g., datasets, models) are automatically downloaded. The preferred platform for running the labs is Google Colab, though participants may also run the exercises locally if they prefer. Materials (slides and Colab notebooks) will be accessible to attendees in a public GitHub repository*.
* Examples from previous iterations of the tutorial: https://github.com/terrier-org/ecir2021tutorial, https://github.com/terrier-org/cikm2021tutorial
Approaching Neural Search with Apache Solr and Open-source technologies
Alessandro Benedetti, CEO @ Sease Ltd, Apache Lucene/Solr Committer, Apache Solr PMC Member, Email: firstname.lastname@example.org
Please join us as to explore this exciting new Apache Solr feature and learn how you can leverage it to improve your search experience!
9:00 - 9:20 - Introduction to Semantic Search Problems (vocabulary mismatch problem, semantic similarity)
9:20 - 9:40 - From Text to Vectors (Sparse vs Dense vector representation)
9:40 - 10:10 - how Approximate Nearest Neighbor (ANN) approaches work, with a focus on Hierarchical Navigable Small World Graph (HNSW)
10:10 - 10:40 - how the Apache Lucene implementation works
10:40 - 11:10 - how the Apache Solr implementation works, with the new field type and query parser introduced
11:10 - 11:30 - Break
11:30 - 12:00 - how to run KNN queries and how to use it to rerank a first-stage pass
12:00 - 12:35 - how to generate vectors from text and integrate large language models with Apache Solr"
12:35 - 13:05 - Limitations and how to mitigate them
13:05 - 13:20 - Future Works
Attendees are required to bring their own computers/laptops for the lab component. Slides and code snippets will be provided.
Simplifying NLP researchers work with Datafari Open Source
- Julien Massiera, France Labs, Email: email@example.com
- Cedric Ulmer, France Labs, Email: firstname.lastname@example.org
NLP researchers need to manipulate text. Their aim is to find the best way to analyse it. But quite often, they need to address the time-consuming part where they extract the text out of source documents. This is useless for their research, but necessary. Then, in case they work for instance with machine learning algorithms, they need to test their algorithm on actual data. Again, this is time consuming. Then comes Datafari into play. Datafari is among the few available open-source Enterprise Search solutions. It covers the necessary steps, from document sources crawling, to indexing and searching, including text extraction. Thanks to this, attendees, in particular NLP researchers, will have an open-source toolbox to simplify their work and focus on their actual research.
14:00 - 14:30 Understanding Datafari, its architecture and its components
14:30 - 15:00 Installing Datafari
15:00 - 16:00 Going through use case A: using Datafari to easily extract text from multiple sources and multiple formats, and retrieving the output as either raw text files or within a Solr search index
16:00 - 16:20 Break
16:20 - 17:20 Going through use case B: using Datafari to add an NLP step in the documents crawling pipeline and retrieving the output entities as a new field in a Solr search index.
17:20 - 17:45 Wrap up and questions
Attendees are required to bring their own computers/laptops for the lab component.
Option 1: You can do the full tutorial on your own laptop if you are able to run a linux OS (either directly on the machine or through a VM or a docker container). They must have min 12GB of RAM dedicated to Datafari, a min of 1 GHZ CPU, and at least 20GB of disk space available, if possible using an SSD
Option 2: You can do the tutorial on a remote linux system that France Labs will be hosting, using your laptop to connect to it. For this, you will need an internet connectivity, and the possibility to connect via SSH to a remote system (natively included in linux systems, requiring for instance putty or mobaxterm on windows systems).
Diverse Approaches to Systematic Searching
Dr Farhad Shokraneh
Institute of Health Informatics, University College London, London, UK
Centre for Neuromuscular Diseases, National Hospital for Neurology and Neurosurgery, UCLH, London, UK
King's Technology Evaluation Centre, King's College London, London, UK
Division of Psychiatry and Applied Psychology, University of Nottingham, UK
School of Medicine, University of Central Lancashire, Preston, UK
Systematic Review Consultants, Nottingham, UK
Isla Kuhn - Head of Medical Library Services at the University of Cambridge
Searching for literature review purposes could follow different steps, methods, and approaches depending on the complexity of the topic, availability of time, human, machine and information resources, and the type and purpose of the review. Since there are several reviews and evidence synthesis types, one size does not fit all, and the searchers need to tailor the search methods and approaches. The focused context of the tutorial will be biomedical and health sciences.
Introducing the search pyramid concept for the first time, this tutorial will show the reverse progress of search systems to simplify the search string development. Furthermore, a deeper dive into the approaches that users take to start, continue and finish their searches will be discussed to classify the existing approaches, including, but not limited to, minimalist vs maximalist approaches, translational reductions, internal and external validation, pre-search, in-search, and post-search filtering, structural performance tests, peer-review, historical methods, and scoping methods. Each of these methods will be discussed to reveal their best use cases, advantages, disadvantages, and road toward their future developments. The participants will have a chance to practice most of the approaches during the tutorial. There is no pre-requisite for this tutorial; anyone can benefit from the content.
10:00 – 10:20 Introduction to systematic reviews, evidence synthesis, and systematic search
10:20 – 10:45 Typology of evidence synthesis
10:45 – 11:00 Questions and Short Break
11:00 – 11:30 The process of developing the search methods
11:30 – 11:50 The steps to developing a search strategy
11:50 – 12:00 Short Break
12:00 – 13:00 Practice 1: Developing a search strategy
13:00 – 14:00 Lunch Break
14:00 – 14:20 Analysing problem scenarios to develop search methods
14:20 – 14:40 Information needs that trigger a specific approach to systematic searching
14:40 – 14:50 Search pyramid
14:50 – 15:00 Short Break
15:00 – 15:30 Diversity of search functions and tools in databases
15:30 – 16:00 Practice 2: Matching the scenario to the approaches
16:00 – 16:30 Discussions and Closing
Attendees are required to bring their own computers/laptops for the lab component. The practice materials and slides will be printed and presented to the participants.