|
前言
databasecolumn 的數據庫大牛們(其中包括PostgreSQL的最初伯克利領導:Michael Stonebraker)最近寫了一篇評論當前如日中天的MapReduce 技術的文章,引發劇烈的討論。我抽空在這兒翻譯一些,一起學習。
譯者注:這種 Tanenbaum vs. Linus 式的討論自然會導致非常熱烈的爭辯。但是老實說,從 Tanenbaum vs. Linus 的辯論歷史發展來看,Linux是越來越多地學習并以不同方式應用了 Tanenbaum 等 OS 研究者的經驗(而不是背棄); 所以 MapReduce vs. DBMS 的討論,希望也能給予后來者更多的啟迪,而不是對立。
原文見:http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html
MapReduce: A major step backwards/MapReduce: 一個巨大的倒退
注:作者是 David J. DeWitt 和 Michael Stonebraker
On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.
1月8日,一位Database Column的讀者詢問我們對各種新的分布式數據庫研究工作有何看法,我們就從MapReduce談起吧。現在討論MapReduce恰逢其時,因為最近商業媒體充斥著所謂“云計算(cloud computing)”革命的新聞。這種計算方式通過大量(低端的)并行工作的處理器來解決計算問題。實際上,就是用大量便宜貨(原文是jelly beans)代替數量小得多的高端服務器來構造數據中心。
For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.
例如,IBM和Google已經宣布,計劃構建一個1000處理器的集群,開放給幾個大學,教授學生使用一種名為MapReduce [1]的軟件工具對這種集群編程。加州大學伯克利分校甚至計劃教一年級新生如何使用MapReduce框架編程。
As both educators and researchers,we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of Scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:
我們都既是教育者也是研究人員,MapReduce支持者們大肆宣傳它代表了可伸縮、數據密集計算發展中的一次范型轉移,對此我們非常驚訝。MapReduce就編寫某些類型的通用計算程序而言,可能是個不錯的想法,但是從數據庫界看來,并非如此:
- A giant step backward in the programming paradigm for large-scale data intensive applications
- A sub-optimal implementation, in that it uses brute force instead of indexing
- Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
- Missing most of the features that are routinely included in current DBMS
- Incompatible with all of the tools DBMS users have come to depend on
- 在大規模的數據密集應用的編程領域,它是一個巨大的倒退
- 它是一個非最優的實現,使用了蠻力而非索引
- 它一點也不新穎——代表了一種25年前已經開發得非常完善的技術
- 它缺乏當前DBMS基本都擁有的大多數特性
- 它和DBMS用戶已經依賴的所有工具都不兼容
First,we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.
首先,我們簡要地討論一下MapReduce是什么,然后更詳細地闡述上面列出的5點看法。
What is MapReduce?/何謂MapReduce?
The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.
MapReduce的基本思想很直接。它包括用戶寫的兩個程序:map和reduce,以及一個framework,在一個計算機簇中執行大量的每個程序的實例。
The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.
map程序從輸入文件中讀取"records"的集合,執行任何需要的過濾或者轉換,并且以(key,data)的形式輸出records的集合。當map程序產生輸出記錄,"split"函數對每一個輸出的記錄的key應用一個函數,將records分割為M個不連續的塊(buckets)。這個split函數有可能是一個hash函數,而其他確定的函數也是可用的。當一個塊被寫滿后,將被寫道磁盤上。然后map程序終止,輸出M個文件,每一個代表一個塊(bucket)。
In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; Fi,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.
通常情況下,map程序的多個實例持續運行在compute cluster的不同節點上。每一個map實例都被MapReduce scheduler分配了input file的不同部分,然后執行。如果有N個節點參與到map階段,那么在這N個節點的磁盤儲存都有M個文件,總共有N*M個文件。
The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.
值得注意的地方是,所有的map實例都使用同樣的hash函數。因此,有相同hash值的所有output record會出被放到相應的輸出文件中。
The second phase of a MapReduce job executes M instances of the reduce program, Rj, 1 ≤ j ≤ M. The input for each reduce instance Rj consists of the files Fi,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file,which forms part of the "answer" to a MapReduce computation.
MapReduce的第二個階段執行M個reduce程序的實例, Rj, 1 <= j <= M.每一個reduce實例的輸入是Rj,包含文件Fi,j, 1<= i <= N.注意,每一個來自map階段的output record,含有相同的hash值的record將會被相同的reduce實例處理--不論是哪一個map實例產生的數據。在map-reduce架構處理過后,input records將會被以他們的keys來分組(以排序或者哈希的方式),到一個reduce實例然后給reduce程序處理。和map程序一樣,reduce程序是任意計算語言表示的。因此,它可以對它的records做任何想做事情。例如,可以添加一些額外的函數,來計算record的其他data field。每一個reduce實例可以將records寫到輸出文件中,組成MapReduce計算的"answer"的一部分。
To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.
和SQL可以做對比的是,map程序和聚集查詢中的 group-by語句相似。Reduce函數和聚集函數(例如,average,求平均)相似,在所有的有相同group-by的屬性的列上計算。
We now turn to the five concerns we have with this computing paradigm.
現在來談一談我們對這種計算方式的5點看法。
MapReduce is a step backwards in database access
As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.
- Schemas are good.
- Separation of the schema from the application is good.
- High-level access languages are good.
- Schemas是有益的。
- 將schema和程序分開處理是有益的。
- High-level存取語言是有益的。
MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.
MapReduce沒有學到任何一條,并且倒退回了60年代,倒退回了現代數據庫管理系統發明以前的時代。
The DBMS community learned the importance of schemas,whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.
DBMS社區懂得schemas的重要性,憑借fields和他們的數據類型記錄在儲存中。更重要的,運行狀態的DBMS系統可以確定輸入的記錄都遵循這個schema。這是最佳的保護程序不會添加任何垃圾信息到數據集中。MapReduce沒有任何這樣的功能,沒有任何控制數據集的預防垃圾數據機制。一個損壞的MapReduce數據集事實上可以無聲無息的破壞所有使用這個數據集的MapReduce程序。
It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast,when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.
將schema和程序分開也非常重要。如果一個程序員想要對一個數據集寫一個新程序,他必須知道數據集的結構(record structure)。現代DBMS系統中,schema儲存在系統目錄中,并且可以被任意用戶查詢(使用SQL)它的結構。相反的,如果schema不存在或者存在于程序中,程序員必須檢查程序的代碼來獲得數據的結構。這不僅是一個單調枯燥的嘗試,而且程序員必須能夠找到先前程序的source code。每一個MapReduce程序員都必須承受后者的乏味,因為沒有系統目錄用來儲存records的結構--就算這些結構存在。
During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:
- By stating what you want - rather than presenting an algorithm for how to get it (relational view)
- By presenting an algorithm for data access (Codasyl view)
70年代DBMS社區,在關系型數據庫支持者和Codasys型數據庫支持者之間發有一次"大討論"。一個重點議題就是DBMS存取程序應該寫成哪種方式:
- 描述你想要的--而不是展示一個算法,解釋如何工作的。(關系型數據庫的觀點)
- 展示數據存取的算法。(Codasyl 的觀點)
The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.
討論的結果已經是過去的歷史,但是整個世界看到High-level語言的價值,因此關系型數據庫開始流行.在High-level語言上編寫/修改程序比較容易,而且易于理解. Codasyl曾被批評為"DBMS存取的匯編語言".一個MapReduce程序員和Codasyl程序員類似-他們在low-level語言基礎上做low-level的記錄操作.沒有人提倡回到匯編語言,同樣沒有人被迫去編寫MapReduce程序.
MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.
MapReduce提倡者也許反對這個說法,宣稱他們的目標數據集是沒有schema的.我們不同意這個說法.從輸入數據集中抽取key, map函數至少依賴每個數據集的一個數據字段的存在, reduce函數也是如此,從收到要處理的記錄來計算值.
Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name,{values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.
在google的BigTable(或者Hadoop的HBase)基礎上寫MapReduce的應用并沒有改變這個事實.通過自描述的元組格式(row key, column name,(value)),相同表的不同元組事實上有不同的schema.另外BigTable和HBase 并不提供邏輯獨立,例如view視圖機制.當邏輯schema改變時, View(視圖)很重要地簡化了程序的繼續運行.
MapReduce is a poor implementation
2. MapReduce是一個糟糕的實現
All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.
所有現代DBMS都使用散列或者B樹索引加速數據存取。如果要尋找記錄的某個子集(比如薪水為10000的雇員或者是鞋類專柜的雇員),經常可以使用索引有效地將搜索范圍縮小一到兩個數量級。而且,還有查詢優化器來確定是使用索引還是執行蠻力順序搜索。
MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.
MapReduce沒有索引,因此處理時只有蠻力一種選擇。在索引是更好的存取機制時,MapReduce將劣勢盡顯。
One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.
有人可能會說,MapReduce的價值在于在計算機網格上自動地提供并行執行。這種特性數據庫研究界在上世紀80年代就已經探討過了,而且構建了許多原型,包括 Gamma [2,3], Bubba [4],和 Grace [5]。而Teradata這樣的系統早在80年代晚期,就將這些想法商業化了。
In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.
對這一點做個總結,過去的20年曾出現過許多高性能,商業化的,網格SQL引擎(帶有schemas和索引).與它們相比, MapReduce并沒有表現出眾.
There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.
而MapReduce本身存在一些lower-level實現的問題,特別是skew和數據交換.
One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.
MapReduce提倡者好象忽略的一個因素是skew問題.如"Parallel Database System: The Future of High Performance Database Systems"文章所述, skew是并行查詢系統想要成功達到擴展應用的巨大障礙.當有同樣key的記錄分布變化很廣,這個問題會發生在map階段.這個變化反過來會導致一些reduce實例比其他實例要運行更長時間,這就使整個計算執行時間是最慢的reduce實例的運行時間.并行數據庫業界已經研究這個問題很深了,并開發了一些解決方案, MapReduce社區也許想采用.
There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously-- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.
還存在一個MapReduce支持者曲解的嚴重性能問題.想想N個map實例產生M個輸出文件-每個最后由不同的reduce 實例處理,這些文件寫到運行map實例機器的本地硬盤.如果N是1,000, M是500, map階段產生500,000個本地文件.當reduce階段開始, 500個reduce實例每個需要讀入1,000文件,并用類似FTP協議把它要的輸入文件從map實例運行的節點上pull取過來.假如同時有數量級為100的reduce實例運行,那么2個或2個以上的reduce實例同時訪問同一個map節點來獲取輸入文件是不可避免的-導致大量的硬盤查找,有效的硬盤運轉速度至少降低20%.這就是為什么并行數據庫系統不實現split文件,采用push(推到socket套接字)而不是pull.由于MapReduce的出色容錯依賴于如何實現split文件, MapReduce框架是否成功地轉向使用push范式,不是很清楚.
Given the experimental evaluations to date,we have serious doubts about howwell MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.
僅用實驗結果來說,我們嚴重懷疑MapReduce應用如何能很好地擴展.甚至, MapReduce實現者應該好好學習一下近 25年來的并行DBMS研究文獻.
MapReduce is not novel
MapReduce一點也不新穎
The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.
MapReduce社區好象覺得他們發現了一個嶄新的處理大數據集的范式.事實上, MapReduce使用的技術已經有20年了.把大數據集切分成小的分區在"Application of Hash to Data Base Machine and Its Architecture"[11] 里作為一種新的join算法的基礎就已經提出,在"Multiprocessor Hash-Based Join Algorithms"[7], Gerber 展現了Kitsuregawa的技術如何被擴展,在shared-nothing [8]的集群上結合分區表,分區執行, hash拆分來并行執行join. DeWitt [2]表明這些技術可以被采用來并行執行無論有無group 語句的合并. DeWitt和Gray [6]描述了一些并行數據庫系統,以及它們如何處理查詢. Shatdal和Naughton [9] 則探索了并行執行合并的替代策略.
Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.
Teradata已經利用這些技術實現了一個商業DBMS產品20年了; 特別是MapReduce人群宣稱發明的那些技術.
While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation,we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.
當MapReduce支持者肯定地聲稱寫MapReduce函數是他們的軟件不同于并行SQL實現的地方,我們不得不提醒他們 POSTGRES在1980年代中期就支持用戶自定義函數和自定義合并了.實質上,所有現代數據庫系統已經提供這樣的功能很久了,大約起源于1995年左右的Illustra引擎.
MapReduce is missing features
MapReduce缺乏的特性
All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:
以下特性在現代DBMS都已經缺省提供,但MapReduce里沒有.
- Bulk loader -- to transform input data in files into a desired format and load it into a DBMS
- Indexing -- as noted above
- Updates -- to change the data in the data base
- Transactions -- to support parallel update and recovery from failures during update
- Integrity constraints -- to help keep garbage out of the data base
- Referential integrity-- again, to help keep garbage out of the data base
- Views -- so the schema can change without having to rewrite the application program
- Bulk loader -把文件里的輸入數據轉成期望格式,然后導入DBMS.
- Indexing -如上所述.
- 更新-在數據庫里修改數據
- 事務-支持并行更新,更新過程中能從失敗處恢復.
- 完整性約束-用來擋住垃圾數據
- 引用完整-也是用來擋住垃圾數據
- 視圖- schema可以改變,也無須重寫應用程序
In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.
總之, MapReduce僅提供了現代DBMS功能的一小部分.
MapReduce is incompatible with the DBMS tools
它和DBMS工具不兼容
A modern SQL DBMS has available all of the following classes of tools:
一個現代SQL DBMS可以有以下類別的工具:
- Report writers (e.g., Crystal reports) to prepare reports for human visualization
- Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses
- Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets
- Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another
- Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.
- 報告生成器(例如水晶報表),可以提供可視化報告.
- 商業智能工具(例如Business Objects或者Cognos),支持數據倉庫的專門查詢.
- 數據挖掘工具(例如Oracle數據挖掘, IBM DB2智能挖掘),允許用戶在大數據集中發現結構.
- 復制工具(例如Golden Gate),允許用戶從一個DBMS復制數據到另外一個.
- 數據庫設計工具(例如Embarcadero),幫助用戶創建一個數據庫
MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.
MapReduce不能使用這些工具,但也沒有自己的工具.直到它成為SQL兼容,或者有人寫這些工具,否則在完成一個終端應用時MapReduce會一直很難用.
In Summary
It is exciting to see a much larger community engaged in the design and implementation of Scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology-- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.
看到規模大得多的社區加入可伸縮的查詢處理技術的設計與實現,非常令人興奮。但是,我們要強調,他們不應該忽視數據庫技術40多年來的教訓,尤其是數據庫技術中數據模型、物理和邏輯數據獨立性、像SQL這樣的聲明性查詢語言等等,可以為應用程序的設計、實現和維護帶來的諸多好處。而且,計算機科學界往往喜歡自行其是,不理會其他學科的文獻。我們希望更多人來一起研究過去25年的并行DBMS文獻。MapReduce要達到能夠與現代DBMS相提并論的水平,還需要開發大量特性和工具。
We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finallywe note that some database researchers are beginning to explore using the MapReduce framework as the basis for building Scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.
我們完全理解數據庫系統也有自己的問題。數據庫界清楚地認識到,現在數據庫系統還太“難”使用,而且正在解決這一問題。數據庫界也從MapReduce為其應用程序提供的出色的容錯上學到了有價值的東西。最后,我們注意到,一些數據庫研究人員也開始研究使用MapReduce框架作為構建可伸縮數據庫系統的基礎。雅虎研究院的Pig[10]項目就是其中之一。
References
[1] "MapReduce: Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.
[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.
[4] "Gamma - A High Performance Dataflow Database Machine," DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.
[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.
[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.
[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.
[8] "The Case for Shared-Nothing," Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.
[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.
[10] "Pig", Chris Olston, it知識庫:MapReduce:一個巨大的倒退,轉載需保留來源! 鄭重聲明:本文版權歸原作者所有,轉載文章僅為傳播更多信息之目的,如作者信息標記有誤,請第一時間聯系我們修改或刪除,多謝。