The Document Warehousing Concepts

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!

Introduction

The modern-day business intelligence practices call for the collection and use of a large quantity of data about the customers, suppliers, employees, and society in general that can be collected in the forms of structured data or reports, images, sounds, texts, and many others. Hence, a challenge for information technology (IT) specialists is in creating and managing systems that will allow business analysts to benefit from this sea of data. This paper will analyze the main concepts involved in document warehousing, including the definition of this process, the strategies used for document warehousing, and how analytics can be performed on unstructured data.

Warehousing and the Difference between Data and Document Warehousing

It is natural for businesses to collect and store data about the processes related to their work. A data warehouse is a relational database created for analytical purposes (Bouaziz et al., 2019). If data is stored in a database or as a flat file, it is difficult to analyze and visualize and therefore make insights from these data points, which is why data warehouses are created. According to Bouaziz et al. (2019), the benefit of a data warehouse is the ability to separate the transaction workload and analytical work. Thus, a purpose of a data warehouse is to help make insights from data as one can perform analysis using data warehouses, unlike with databases. Srivastava et al. (2013) note that a benefit of data warehousing is the ability to leverage different levels of granularity and different dimensions of data. Over the last several years, businesses began to collect information in the form of multimedia files, such as texts, photos, videos, and others, which are related to their business transactions, which are more difficult to classify and analyze when compared to structured data.

In contrast to data warehousing, a document warehouse is a framework for the storage and use of unstructured data, such as texts or multimedia (Azabou et al., 2020). According to Azabou et al. (2020), document warehousing is the next wave of data warehousing concepts, which began due to the increased demand for storing and processing multimedia information such as pictures, sounds, videos, and others. However, Azabou et al. (2020) also note that the lines between structured and unstructured data have been blurred in recent years, and therefore, there are only a few differences that one should consider when managing warehouses for multimedia or structured data. Apart from multimedia, document warehouses also store textual data, such as customer complaints or reports, which are typically sets of text that cannot be analyzed using the data warehousing analysis methods. However, they contain a large amount of valuable information (Azabou et al., 2020). Although it is beneficial for business intelligence specialists to analyze this data for new insights, the nature of unstructured information poses a wide range of challenges in terms of storing, sorting, and utilizing the collected resources.

Since document warehousing is a continuation of data warehousing, some of the latter’s concepts are applicable when working with multimedia files. Azabou et al. (2020) state that prior to the emergence of document warehousing structures, one would have to build a data warehouse using a relational database, utilize a document management system for unstructured data, and create logical links between environments through rows or columns that point to the multimedia data. The need to create additional fields to manage the connections with the unstructured data made the incorporation of unstructured data complex and challenging; for example, some software updates could break the links between systems.

Currently, organizations are in need of data warehousing systems that allow the storage of information other than that in rows in columns since multimedia has become an integral part of life. For example, Azabou et al. (2020) argue that fifteen years ago, medical records would contain data in the form of texts or numbers, while now it is integrated with images, X-Rays, voice memos, and other multimedia. Hence, document warehousing, despite being a new concept, will be in high demand in the future. Similarly, Srivastava et al. (2013) note that there is an increased tendency towards searching and storing multimedia information on web resources.

Basics of Document Warehousing

The current multimedia storage systems remain complex, which makes the process of data retrieval complex and time-consuming (Srivastava et al., 2013). Hence, there is a delay when accessing multimedia information from data warehouses, which several researchers and practitioners attempted to address. The process of document warehousing intends on organizing and storing data in a way that allows for an efficient and insightful retrieval (Azabou et al., 2020). Alternatively, data warehousing can be viewed as a process of transforming data into information.

When examining how flat files or databases are transformed into data warehouses, it is important to look at the Staging Area with a Staging Database. A Staging Database is a temporary storage of information, which is transferred from flat files and databases through the process titled extract-transform-load or ETL (Azabou et al., 2020). Moreover, ETL is also used once the information is transferred from the Staging Area into the data warehouse. In a data warehouse, there are the following types of data: raw data, for example, the rows and columns containing the actual information; metadata is a description and characteristics of the raw data. Arguably, metadata is the key feature that allows differentiating between a database and a data warehouse since it contains the different characteristics of the information that is being stored. Data marts are similar to data warehouses; however, they allow to gain access to some parts of data for specific user groups and restrict access to other types of data (Ferro et al., 2019). Hence, data marts allow dividing information in a data warehouse into different structures and providing access to each mart for a different set of users.

With multimedia data, one has to use specific schemes for extraction, designing specifications for metadata, creating indexing schemes, visualizing, and aggregating the multimedia information that will be stored in a warehouse (Srivastava et al., 2013). Since this paper focuses on the basic concepts surrounding document warehousing, the next section will describe how documents are selected and stored in a warehouse.

The processes involved in the preparation of the multimedia data for storage and analysis are similar to those used when working with structured data. The initial phase of document warehousing consists of the following stages: extracting, transforming, and loading (Srivastava et al., 2013). In essence, the information that needs to be stored in a warehouse is selected, transformed based on the operational needs, and loaded into the warehousing structure. Here, extraction can be viewed as parsing since there is a need to determine a pattern within the collected data that will help analyze it later. Here, one can use a set of ETL tools that are widely available as they are applied for data warehousing to complete the process.

OLAP

On-Line Analytical Processing (OLAP) is used when working with structured and unstructured data alike. It is the process of performing online queries on the data from the data warehouse (Bouaziz et al., 2019). One issue with OLAP design is that it was developed for the analysis of numeric information and therefore is not suitable for working with multimedia without alternations (Srivastava et al., 2013). Dimensions and descriptors obtained through different computation models for raw materials can help enhance the speed of retrieving multimedia information (Srivastava et al., 2013). Hence, the challenge with multimedia information extraction is that, unlike with humans for whom multimedia information is easier to work, there is a need to employ an algorithm that would address the complexity and size of multimedia data to make it usable for computers. For one, it is not possible to manually search and extract multimedia data from a warehouse. Srivastava et al. (2013) define OLAP as a set of analysis techniques used to work with data in a data warehouse. However, the issue with applying OLAP for multimedia data is again linked to the complexity of these types of files.

Notably, Srivastava et al. (2013) criticize OLAP since although it is commonly used when working with regular data, the approach limits the ability to analyze and work with multimedia files. When data is stored in a warehouse, several dimension models can be used, such as star, snowflake, or galaxy, to account for the multidimensionality. Additionally, Srivastava et al. (2013) state that a separate subsystem is required to store the metadata about the multimedia files.

OLAP server model can be applied to optimize queries as it is used to precalculated and aggregate them. The final stage of using OLAP is the Queries and Reports stage, where the end-user software allows to work with the collected multimedia information. This interface contains the tools for the analysis and visualization used by the business intelligence specialists.

Concepts for Document Warehousing

Srivastava et al. (2013) describe a principle of document warehousing where files are stored as XMLs using XQuery language. These files are parsed and analyzed through the minimal requirements patterns. Notably, XML is designed for structured data. Hence, this model is an adaptation that is not entirely suitable for use when working with multimedia. The data retrieval, in this case, uses the description of the media, such as the duration, size, color, and other elements, as well as the content analysis, such as the themes and ideas in a specific file (Srivastava et al., n.d.). Additionally, Srivastava et al. (2013) recommend leveraging the data types when retrieving and analyzing the multimedia information from a document warehouse.

Several aspects to consider when using this approach is that data warehousing employs a computation model with a strict number of dimensions, where these dimensions are static and computed in a unique manner. However, with multimedia, there are several descriptors that can be computed in different ways, making the computation process more complex. To address this, Srivastava et al. (2013.) offer using the Functional Multifermion Multidimensional Model where members are computed using various functional versions.

Extraction Model

When retrieving multimedia from a document warehouse, one can use the direct extraction model. Under this approach, one creates schemes, maps, or models that represent the stored data and interact directly with the information stored in a warehouse, but this approach is time and resource-consuming (Srivastava et al., 2013). When using a fact table and data marts, each mart is stored in a separate XLM file, but each table can be used as a dimension for another table. In this case, Srivastava et al. (2013) recommend using dynamic indexing that helps speed up the process of data retrieval from these aggregated tables. However, in order to maximize the efficiency of this process, Srivastava et al. (2013) developed a concept of a Content Server, which will be detailed in the next section. Hence, the concept of extraction of files with document management is a compelling matter that requires designing a system that would accurately capture the nature and characteristics of the data stored within these systems.

Content Server

Different researchers have offered varied models for working with the multimedia data. Srivastava et al. (2013) employed a concept of a Content Server for Document Warehousing to address the problems of efficiency when accessing multimedia data. Content Server is a server-centered architecture that is divided into six subsystems, which are the ETL subsystem, the metadata extractor and manager subsystem, the data warehouse subsystem, the metadata indexer subsystem, the OLAP engine, and finally, the front end tools subsystem (Srivastava et al., 2013). Firstly, the ETL tools subsystem is designed to help extract and transform information based on the needs of the organization, for example, by assigning different attributes or categories to the data points. The features of the multimedia files, including the name, length, format, and others, are stored as an XLM file. Multimedia data is stored one the OLAP server, alongside with the content server and knowledge server, from where it is extracted to through indexing.

The subsystem of metadata extraction is designed to analyze the multimedia for the purpose of extracting the needed metadata. Since multimedia is stored in varied formats and forms, each requires an extractor, which is the main complication of multimedia data extraction. Hence, the content server should have a separate system for extracting metadata from texts, photos, videos, and other forms of multimedia used by an organization.

Aggarwal and Tiwari (2017) argue that only approximately 20% of business intelligence information can be extracted from numeric data, which is why the development and integration of document warehousing have become essential for contemporary businesses. Aggarwal and Tiwari (2017) also advocate for the integration of OLAP processing techniques with document processing methods for the enhancement of business analytics. Aggarwal and Tiwari (2017) suggest using a Self-Organizing Map (SOM) and ECA rules to cluster textual data and manage the retrieval process. Hence, the Content Server approach designed and discussed by Srivastava et al. (2013) allows addressing the common issues that arise as a result of working with the multimedia data.

Not-only-SQL

One approach to managing the storage of multimedia data is the not-only-SQL. Boaziz et al. (2019) discuss the Not-only-SQL as an alternative to relational databases that emerged as an outcome of the emergence of the need to work with large quantities of multimedia data. One defining feature of these databases is that they are either schemas-free or have shames that are less structured when compared to traditional ones. Hence, organizations working with the not SQL databases should understand that each file within this warehouse is different from the others, and the process of designing such a database would involve extracting shemas from each individual file. Boaziz et al. (2019) also offer to create a structure graph, which is a graph that depicts the schemas extracted from each file and creating a multidimensional model. Hence, the process described by Boazis et al. (2019) is similar to the steps that Srivastava et al. (2013) describe in their content server approach. In essence, the structuring of the document warehousing is rooted in the creation of the schema that would contain the metadata and descriptions of the file that could be unique for each individual file.

Conclusion

In summary, this paper presents findings for the storage of multimedia files under the Document Warehousing concepts. Unlike Data Warehousing, Document Warehousing is a challenging process since building an efficient system for the storage and retrieval of multimedia data is complicated by the lack of algorithms that would help analyze and sort it. However, concepts such as content-server are very helpful in addressing these issues. This report reviews and analyses some of the common approaches to selecting, categorizing, storing, and retrieving multimedia data. Some of the systems and approaches used for data warehousing can be used and adapted for document warehousing. However, the main difficulty is in the multidimensionality of the unstructured data, which requires the developer of the warehouse to utilize modeling techniques that would allow the creation of schemas or links between different types of files.

References

Aggarwal M. & Tiwari A.K. (2018). Approach for information retrieval by using self-organizing map and crisp set. In Agrawal S., Devi A., Wason R., Bansal P. (eds) Speech and Language Processing for Human-Machine Communications. Advances in Intelligent Systems and Computing, vol 664 (pp. 51-56). Springer, Singapore.

Azabou, M., Banjar, A., & Feki, J. (2020). Enhancing the diamond document warehouse model. International Journal of Data Warehousing and Mining, 16(4), 1-25. doi: 10.4018/ijdwm.2020100101

Bouaziz, S., Nabli, A., & Gargouri, F. (2019). Design a data warehouse schema from document-oriented database. Procedia Computer Science, 159, 221-230. doi: 10.1016/j.procs.2019.09.177

Ferro, M., Fragoso, R., & Fidalgo, R. (2019). . IEEE Conference on Commerce and Enterprise Computing, CEC.

Srivastava, M., Singh, S. K., & Abbas, S. Q. (2013). An architecture for creation of multimedia data warehouse. International Journal of Engineering Science and Innovative Technology (IJESIT), 2(4), 309-315.

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!