Searching the cloud

October 2018

Drawers of paper recordsCloud computing is becoming increasingly popular. But what are its implications on search? Jochen L Leidner MBCS, Director of Research at Thomson Reuters, reports.

There is a trend in companies around the world of reducing the cost of internal IT infrastructure by giving up their data centres in favour of cloud computing (BCS 2012), i.e by embracing external servers for storage and computation.

NIST, the US National Institute for Standards and Technology, defines cloud computing as ‘a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction’.

Amazon, Google, IBM and Microsoft are some examples of big cloud vendors that rent out their hardware assets (servers, disks, networks). At the level of applications, there is also a trend to move away from locally-installed software on desktop PCs towards cloud-hosted web applications (software-as-a-service or SaaS), to save by eliminating the life-cycle of installing, maintaining, upgrading and sun-setting applications. SAP, Salesforce, Oracle, ServiceNow and Workday are key SaaS vendors.

Both lower-level storage and computing resources and higher-level application resources benefit from cloud migrations, and not just in terms of cost: security can be improved by a cloud migration, as cloud vendors can spend more on top-grade security staff in large quantities due to these vendors’ scale.

Elasticity is another huge cloud advantage: the number of servers or application users can be rapidly increased or decreased depending on dynamic demand, since cloud vendor infrastructure is always available in abundance - it is shared with all their customers.

The cloud abstraction means to assume an infinite supply of resources is ‘out there’, which can be rented and released at will, manually by admins and even automatically via APIs.

Before the advent of the cloud, many companies already had heterogeneous networked hardware and software environments in place, and often struggled with data management (i.e., organising their data assets) and knowledge management (i.e., organising their knowledge).

The transition from self-managed, on-premise servers and company-owned data centres happened gradually, and many organisations went through a phase of relying on internal private clouds – sharing resources across departments, but only internally.

Private clouds have the advantage that all assets remain confined within the organisation’s walls, legally and physically, which simplifies governance and security. However, private clouds inherit the disadvantages of both worlds: they neither provide the elasticity and scale of public clouds, nor do they benefit from the scale effects of cloud providers: vendors like Amazon or Google can hire the most expensive security experts, as this cost is distributed across millions of servers, not just hundreds.

Private and public clouds add complexity to these pre-existing challenges in at least three ways: first, the network access leads to delays; second, in a cloud scenario, data access crosses organisational boundaries, which has security and management implications.

Third, the cloud’s elasticity means that rapid commissioning and de-commissioning of cloud storage must be dealt with in terms of incremental index updates to keep search results relevant.

The question of findability of an organisation’s information in the cloud is crucial, which is a large part of effective knowledge management. Anecdotally, about 20 per cent of enterprise employees’ time is spend searching for information, so what we do not want is that information needs are met less effectively when using the cloud than before; yet clearly, it is challenging, to say the least, to keep all documents findable when they are spread across potentially multiple cloud vendors’ millions of servers or buried inside SaaS applications that may not expose their data to indexing servers that function as the librarians to update the library catalogue of keywords.

Companies typically employ web-based content management systems (CMS), and these have their own search functions, but a lot of the information may not reside in the – often centralised – CMS systems any more. So, a new kind of enterprise search may be required to address this. And while we cannot go into details here, or attempt a vendor comparison for lack of space, the table below contains some questions, which most organisations attempting a new implementation of an enterprise search product could try to answer if they want to ensure findability in the organisation’s cloud.

Cloud search: Some implementation questions

  1. What document types are supported by the index crawlers?
  2. Are indexing and retrieval processes federated?
  3. What kind of database and systems connectors are supported by the index crawlers (Oracle RMDB, SAP R/3 ERM, Postgresql/MongoDB/AWS S3, Microsoft Exchange, ...)?
  4. Is the enterprise search architecture aligned with my organisation’s structure?
  5. What is the average response time for a set of typical queries under typical system load?
  6. What is the maximum index size (# unique words, # unique documents)?
  7. What is the cost of implementing a particular enterprise search application? How is the cost structured (e.g. by user or by CPU)?
  8. What additional network traffic will implementing a particular solution create on the corporate network?
  9. How is the investment into a solution protected? For example, is there a clause that the source code of the system will be provided if the vendor decides to sun-set the product?
  10. What is the security model (permissions for documents, users, groups)? Does the system support search over encrypted content (e.g. homomorphic encryption)?
  11. How are internal and external cloud resources communicated to the index crawler? Whose responsibility is it to trigger life-cycle state updates, and what API can be used?
  12. What are meaningful and safe default access privileges for indexed cloud data so they can be found using universal enterprise search queries?
  13. What worst-case time guarantees are given in terms of time from storing a new file on an external cloud storage node to that file’s content being available in a search?
  14. What policies are available to control index freshness depending on known data volatility (ie, frequency of change)?

One of the main risks of cloud use is the improper management of access permissions. Since public cloud resources like Amazon AWS S3 storage buckets reside outside the firewall of the organisation, accidentally granting general read access means world-wide read access, so software bugs can have disastrous consequences, from leaking trade secrets to the violation of laws and regulations by disclosing especially protected personal information, while an overly defensive approach leads to cloud silos that cannot be accessed by the organisation’s cloud search.

The published literature on cloud indexing and retrieval is still nascent: for example, Leidner (2018) surveys some early work, while Liu et al. (2012) describe methods to efficiently process ranked queries in cloud environments. And Shokouhi and Si (2011) is a survey of recent research in federated search. Uddin et al., (2013) discuss the fact that timely access in the limit means real-time search in the cloud, the immediate availability of data in search results with hard guarantees. There is an opportunity for the IR research community to come together to conduct joint benchmark studies across research groups to find out what works best.

This article opens up the discussion of the question of what the migration to the cloud means for search in an organisation. More research in distributed computing will be needed, in particular, regarding federated indexing, retrieval, caching and replication. A challenge will be to strike the balance between ensuring that indices are complete across local and cloud machines and at the same time permissions are respected and that the user’s search experience is fast and effective.

Further reading
  1. ‘Cloud Computing: Moving IT Out Of The Office’. London, UK: BCS - The Chartered Institute for IT. (2012)
  2. Leidner, Jochen L. (2018) ‘Information Retrieval in the Cloud’ Pre-print arXiv:1807.00257 [cs.IR] (accessed 2018-07-03).
  3. Liu, Q., C. C. Tan, J. Wu and G. Wang (2012) ‘Efficient informtion retrieval for ranked queries in cost-effective cloud environments,’ Proceedings IEEE INFOCOM 2012, Orlando, FL, USA.
  4. Shokouhi, Milad and Luo Si (2011) ‘Federated Search’. (FTIR 5(1)) Boston, MA, USA: Now Publishing.
  5. Uddin, Misbah, Skinner, Amy, Stadler, Rolf and Clemm, Alexander (2013) ‘Real-Time Search in Clouds’, Proceedings of the 2013 IFIP/IEEE International Symposium on Integrated Network Management (INM 2013), New York City, NY, USA: IEEE.
 

Image: GettyImages-124577526