Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster ...Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster large scale data where the similarities are not sparse in many cases. This paper presents two variants of AP for grouping large scale data with a dense similarity matrix. The local approach is partition affinity propagation (PAP) and the global method is landmark affinity propagation (LAP). PAP passes messages in the subsets of data first and then merges them as the number of initial step of iterations; it can effectively reduce the number of iterations of clustering. LAP passes messages between the landmark data points first and then clusters non-landmark data points; it is a large global approximation method to speed up clustering. Experiments are conducted on many datasets, such as random data points, manifold subspaces, images of faces and Chinese calligraphy, and the results demonstrate that the two ap-proaches are feasible and practicable.展开更多
Data is a key asset for digital platforms,and mergers and acquisitions(M&As)are an important way for platform enterprises to acquire it.The types of data obtained from intra-industry and cross-sector M&As diff...Data is a key asset for digital platforms,and mergers and acquisitions(M&As)are an important way for platform enterprises to acquire it.The types of data obtained from intra-industry and cross-sector M&As differ,as does the extent to which they interact within or between platforms.The impact of such data on corporate market performance is an important question to consider when selecting strategies for digital platform M&As.Based on our research on advertising-driven platforms,we developed a two-stage Hotelling game model for comparing the market performance effects of intra-industry M&As and cross-sector M&As for digital platforms.We carried out an empirical test using relevant data from advertising-driven digital platforms between 2009 and 2021,as well as a case study on Baidu’s M&A activities.Our research discovered that intra-industry M&As driven by“data economies of scale”and cross-sector M&As driven by“data economies of scope”are both beneficial to the market performance of platform enterprises.Intra-industry M&As have a more significant positive effect on the market performance of platform enterprises because the same types of data are easier to integrate and develop the“network effect of data scale”.From a data factor perspective,this paper reveals the inherent economic logic by which different types of M&As influence the market performance of digital platforms,as well as policymaking recommendations for all digital platforms to select M&A strategies based on data scale,data scope,and the network effect of data.展开更多
Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effectiv...Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>展开更多
The small-scale drilling technique can be a fast and reliable method to estimate rock strength parameters. It needs to link the operational drilling parameters and strength properties of rock. The parameters such as b...The small-scale drilling technique can be a fast and reliable method to estimate rock strength parameters. It needs to link the operational drilling parameters and strength properties of rock. The parameters such as bit geometry, bit movement, contact frictions and crushed zone affect the estimated parameters.An analytical model considering operational drilling data and effective parameters can be used for these purposes. In this research, an analytical model was developed based on limit equilibrium of forces in a Tshaped drag bit considering the effective parameters such as bit geometry, crushed zone and contact frictions in drilling process. Based on the model, a method was used to estimate rock strength parameters such as cohesion, internal friction angle and uniaxial compressive strength of different rock types from operational drilling data. Some drilling tests were conducted by a portable and powerful drilling machine which was developed for this work. The obtained results for strength properties of different rock types from the drilling experiments based on the proposed model are in good agreement with the results of standard tests. Experimental results show that the contact friction between the cutting face and rock is close to that between bit end wearing face and rock due to the same bit material. In this case,the strength parameters, especially internal friction angle and cohesion, are estimated only by using a blunt bit drilling data and the bit bluntness does not affect the estimated results.展开更多
Six national-scale,or near national-scale,geochemical data sets for soils or stream sediments exist for the United States.The earliest of these,here termed the 'Shacklette' data set,was generated by a U.S. Geologica...Six national-scale,or near national-scale,geochemical data sets for soils or stream sediments exist for the United States.The earliest of these,here termed the 'Shacklette' data set,was generated by a U.S. Geological Survey(USGS) project conducted from 1961 to 1975.This project used soil collected from a depth of about 20 cm as the sampling medium at 1323 sites throughout the conterminous U.S.The National Uranium Resource Evaluation Hydrogeochemical and Stream Sediment Reconnaissance(NUREHSSR) Program of the U.S.Department of Energy was conducted from 1975 to 1984 and collected either stream sediments,lake sediments,or soils at more than 378,000 sites in both the conterminous U.S.and Alaska.The sampled area represented about 65%of the nation.The Natural Resources Conservation Service(NRCS),from 1978 to 1982,collected samples from multiple soil horizons at sites within the major crop-growing regions of the conterminous U.S.This data set contains analyses of more than 3000 samples.The National Geochemical Survey,a USGS project conducted from 1997 to 2009,used a subset of the NURE-HSSR archival samples as its starting point and then collected primarily stream sediments, with occasional soils,in the parts of the U.S.not covered by the NURE-HSSR Program.This data set contains chemical analyses for more than 70,000 samples.The USGS,in collaboration with the Mexican Geological Survey and the Geological Survey of Canada,initiated soil sampling for the North American Soil Geochemical Landscapes Project in 2007.Sampling of three horizons or depths at more than 4800 sites in the U.S.was completed in 2010,and chemical analyses are currently ongoing.The NRCS initiated a project in the 1990s to analyze the various soil horizons from selected pedons throughout the U.S.This data set currently contains data from more than 1400 sites.This paper(1) discusses each data set in terms of its purpose,sample collection protocols,and analytical methods;and(2) evaluates each data set in terms of its appropriateness as a national-scale geochemical database and its usefulness for nationalscale geochemical mapping.展开更多
A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns r...A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns revealed from their mobile phones,researchers and practitioners are now equipped to derive the largest trip samples for a region.Because of its ubiquitous use,extensive coverage of telecommunication services and high penetration rates,travel demand can be studied continuously in fine spatial and temporal resolutions.The derived sample or seed trip matrices are coupled with surveyed commute flow data and prevalent travel demand modeling techniques to provide estimates of the total regional travel demand in the form of origindestination(OD)matrices.The methodology is evaluated in a series of real world transportation planning studies and proved its potentials in application areas such as dynamic traffic assignment modeling,integrated corridor management and online traffic simulations.展开更多
Due to the conflict between huge amount of map data and limited network bandwidth, rapid trans- mission of vector map data over the Internet has become a bottleneck of spatial data delivery in web-based environment. T...Due to the conflict between huge amount of map data and limited network bandwidth, rapid trans- mission of vector map data over the Internet has become a bottleneck of spatial data delivery in web-based environment. This paper proposed an approach to organizing and transmitting multi-scale vector river network data via the Internet progressively. This approach takes account of two levels of importance, i.e. the importance of river branches and the importance of the points belonging to each river branch, and forms data packages ac- cording to these. Our experiments have shown that the proposed approach can reduce 90% of original data while preserving the river structure well.展开更多
Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes...Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes the application of GPU parallel processing technology to the focusing inversion method, aiming at improving the inversion accuracy while speeding up calculation and reducing the memory consumption, thus obtaining the fast and reliable inversion results for large complex model. In this paper, equivalent storage of geometric trellis is used to calculate the sensitivity matrix, and the inversion is based on GPU parallel computing technology. The parallel computing program that is optimized by reducing data transfer, access restrictions and instruction restrictions as well as latency hiding greatly reduces the memory usage, speeds up the calculation, and makes the fast inversion of large models possible. By comparing and analyzing the computing speed of traditional single thread CPU method and CUDA-based GPU parallel technology, the excellent acceleration performance of GPU parallel computing is verified, which provides ideas for practical application of some theoretical inversion methods restricted by computing speed and computer memory. The model test verifies that the focusing inversion method can overcome the problem of severe skin effect and ambiguity of geological body boundary. Moreover, the increase of the model cells and inversion data can more clearly depict the boundary position of the abnormal body and delineate its specific shape.展开更多
Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face ...Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face challenges like the inconsistence of terminology in electronic health records (EHR) and the complexities in data quality and data formats in regional healthcare platform.In this paper,we propose methodology and process on constructing large scale cohorts which forms the basis of causality and comparative effectiveness relationship in epidemiology.We firstly constructed a Chinese terminology knowledge graph to deal with the diversity of vocabularies on regional platform.Secondly,we built special disease case repositories (i.e.,heart failure repository) that utilize the graph to search the related patients and to normalize the data.Based on the requirements of the clinical research which aimed to explore the effectiveness of taking statin on 180-days readmission in patients with heart failure,we built a large-scale retrospective cohort with 29647 cases of heart failure patients from the heart failure repository.After the propensity score matching,the study group (n=6346) and the control group (n=6346) with parallel clinical characteristics were acquired.Logistic regression analysis showed that taking statins had a negative correlation with 180-days readmission in heart failure patients.This paper presents the workflow and application example of big data mining based on regional EHR data.展开更多
基金the National Natural Science Foundation of China (Nos. 60533090 and 60603096)the National Hi-Tech Research and Development Program (863) of China (No. 2006AA010107)+2 种基金the Key Technology R&D Program of China (No. 2006BAH02A13-4)the Program for Changjiang Scholars and Innovative Research Team in University of China (No. IRT0652)the Cultivation Fund of the Key Scientific and Technical Innovation Project of MOE, China (No. 706033)
文摘Recently a new clustering algorithm called 'affinity propagation' (AP) has been proposed, which efficiently clustered sparsely related data by passing messages between data points. However, we want to cluster large scale data where the similarities are not sparse in many cases. This paper presents two variants of AP for grouping large scale data with a dense similarity matrix. The local approach is partition affinity propagation (PAP) and the global method is landmark affinity propagation (LAP). PAP passes messages in the subsets of data first and then merges them as the number of initial step of iterations; it can effectively reduce the number of iterations of clustering. LAP passes messages between the landmark data points first and then clusters non-landmark data points; it is a large global approximation method to speed up clustering. Experiments are conducted on many datasets, such as random data points, manifold subspaces, images of faces and Chinese calligraphy, and the results demonstrate that the two ap-proaches are feasible and practicable.
基金supported by the National Natural Science Foundation of China“Research on Cross-sector Competition Effect and Regulatory Policy of Digital Platforms Based on Inter-platform Network Externalities”(Grant No.72103085).
文摘Data is a key asset for digital platforms,and mergers and acquisitions(M&As)are an important way for platform enterprises to acquire it.The types of data obtained from intra-industry and cross-sector M&As differ,as does the extent to which they interact within or between platforms.The impact of such data on corporate market performance is an important question to consider when selecting strategies for digital platform M&As.Based on our research on advertising-driven platforms,we developed a two-stage Hotelling game model for comparing the market performance effects of intra-industry M&As and cross-sector M&As for digital platforms.We carried out an empirical test using relevant data from advertising-driven digital platforms between 2009 and 2021,as well as a case study on Baidu’s M&A activities.Our research discovered that intra-industry M&As driven by“data economies of scale”and cross-sector M&As driven by“data economies of scope”are both beneficial to the market performance of platform enterprises.Intra-industry M&As have a more significant positive effect on the market performance of platform enterprises because the same types of data are easier to integrate and develop the“network effect of data scale”.From a data factor perspective,this paper reveals the inherent economic logic by which different types of M&As influence the market performance of digital platforms,as well as policymaking recommendations for all digital platforms to select M&A strategies based on data scale,data scope,and the network effect of data.
文摘Social media data created a paradigm shift in assessing situational awareness during a natural disaster or emergencies such as wildfire, hurricane, tropical storm etc. Twitter as an emerging data source is an effective and innovative digital platform to observe trend from social media users’ perspective who are direct or indirect witnesses of the calamitous event. This paper aims to collect and analyze twitter data related to the recent wildfire in California to perform a trend analysis by classifying firsthand and credible information from Twitter users. This work investigates tweets on the recent wildfire in California and classifies them based on witnesses into two types: 1) direct witnesses and 2) indirect witnesses. The collected and analyzed information can be useful for law enforcement agencies and humanitarian organizations for communication and verification of the situational awareness during wildfire hazards. Trend analysis is an aggregated approach that includes sentimental analysis and topic modeling performed through domain-expert manual annotation and machine learning. Trend analysis ultimately builds a fine-grained analysis to assess evacuation routes and provide valuable information to the firsthand emergency responders<span style="font-family:Verdana;">.</span>
文摘The small-scale drilling technique can be a fast and reliable method to estimate rock strength parameters. It needs to link the operational drilling parameters and strength properties of rock. The parameters such as bit geometry, bit movement, contact frictions and crushed zone affect the estimated parameters.An analytical model considering operational drilling data and effective parameters can be used for these purposes. In this research, an analytical model was developed based on limit equilibrium of forces in a Tshaped drag bit considering the effective parameters such as bit geometry, crushed zone and contact frictions in drilling process. Based on the model, a method was used to estimate rock strength parameters such as cohesion, internal friction angle and uniaxial compressive strength of different rock types from operational drilling data. Some drilling tests were conducted by a portable and powerful drilling machine which was developed for this work. The obtained results for strength properties of different rock types from the drilling experiments based on the proposed model are in good agreement with the results of standard tests. Experimental results show that the contact friction between the cutting face and rock is close to that between bit end wearing face and rock due to the same bit material. In this case,the strength parameters, especially internal friction angle and cohesion, are estimated only by using a blunt bit drilling data and the bit bluntness does not affect the estimated results.
文摘Six national-scale,or near national-scale,geochemical data sets for soils or stream sediments exist for the United States.The earliest of these,here termed the 'Shacklette' data set,was generated by a U.S. Geological Survey(USGS) project conducted from 1961 to 1975.This project used soil collected from a depth of about 20 cm as the sampling medium at 1323 sites throughout the conterminous U.S.The National Uranium Resource Evaluation Hydrogeochemical and Stream Sediment Reconnaissance(NUREHSSR) Program of the U.S.Department of Energy was conducted from 1975 to 1984 and collected either stream sediments,lake sediments,or soils at more than 378,000 sites in both the conterminous U.S.and Alaska.The sampled area represented about 65%of the nation.The Natural Resources Conservation Service(NRCS),from 1978 to 1982,collected samples from multiple soil horizons at sites within the major crop-growing regions of the conterminous U.S.This data set contains analyses of more than 3000 samples.The National Geochemical Survey,a USGS project conducted from 1997 to 2009,used a subset of the NURE-HSSR archival samples as its starting point and then collected primarily stream sediments, with occasional soils,in the parts of the U.S.not covered by the NURE-HSSR Program.This data set contains chemical analyses for more than 70,000 samples.The USGS,in collaboration with the Mexican Geological Survey and the Geological Survey of Canada,initiated soil sampling for the North American Soil Geochemical Landscapes Project in 2007.Sampling of three horizons or depths at more than 4800 sites in the U.S.was completed in 2010,and chemical analyses are currently ongoing.The NRCS initiated a project in the 1990s to analyze the various soil horizons from selected pedons throughout the U.S.This data set currently contains data from more than 1400 sites.This paper(1) discusses each data set in terms of its purpose,sample collection protocols,and analytical methods;and(2) evaluates each data set in terms of its appropriateness as a national-scale geochemical database and its usefulness for nationalscale geochemical mapping.
文摘A method is presented in this work that integrates both emerging and mature data sources to estimate the operational travel demand in fine spatial and temporal resolutions.By analyzing individuals’mobility patterns revealed from their mobile phones,researchers and practitioners are now equipped to derive the largest trip samples for a region.Because of its ubiquitous use,extensive coverage of telecommunication services and high penetration rates,travel demand can be studied continuously in fine spatial and temporal resolutions.The derived sample or seed trip matrices are coupled with surveyed commute flow data and prevalent travel demand modeling techniques to provide estimates of the total regional travel demand in the form of origindestination(OD)matrices.The methodology is evaluated in a series of real world transportation planning studies and proved its potentials in application areas such as dynamic traffic assignment modeling,integrated corridor management and online traffic simulations.
文摘Due to the conflict between huge amount of map data and limited network bandwidth, rapid trans- mission of vector map data over the Internet has become a bottleneck of spatial data delivery in web-based environment. This paper proposed an approach to organizing and transmitting multi-scale vector river network data via the Internet progressively. This approach takes account of two levels of importance, i.e. the importance of river branches and the importance of the points belonging to each river branch, and forms data packages ac- cording to these. Our experiments have shown that the proposed approach can reduce 90% of original data while preserving the river structure well.
基金Supported by Project of National Natural Science Foundation(No.41874134)
文摘Processing large-scale 3-D gravity data is an important topic in geophysics field. Many existing inversion methods lack the competence of processing massive data and practical application capacity. This study proposes the application of GPU parallel processing technology to the focusing inversion method, aiming at improving the inversion accuracy while speeding up calculation and reducing the memory consumption, thus obtaining the fast and reliable inversion results for large complex model. In this paper, equivalent storage of geometric trellis is used to calculate the sensitivity matrix, and the inversion is based on GPU parallel computing technology. The parallel computing program that is optimized by reducing data transfer, access restrictions and instruction restrictions as well as latency hiding greatly reduces the memory usage, speeds up the calculation, and makes the fast inversion of large models possible. By comparing and analyzing the computing speed of traditional single thread CPU method and CUDA-based GPU parallel technology, the excellent acceleration performance of GPU parallel computing is verified, which provides ideas for practical application of some theoretical inversion methods restricted by computing speed and computer memory. The model test verifies that the focusing inversion method can overcome the problem of severe skin effect and ambiguity of geological body boundary. Moreover, the increase of the model cells and inversion data can more clearly depict the boundary position of the abnormal body and delineate its specific shape.
基金Supported by the National Major Scientific and Technological Special Project for"Significant New Drugs Development’’(No.2018ZX09201008)Special Fund Project for Information Development from Shanghai Municipal Commission of Economy and Information(No.201701013)
文摘Regional healthcare platforms collect clinical data from hospitals in specific areas for the purpose of healthcare management.It is a common requirement to reuse the data for clinical research.However,we have to face challenges like the inconsistence of terminology in electronic health records (EHR) and the complexities in data quality and data formats in regional healthcare platform.In this paper,we propose methodology and process on constructing large scale cohorts which forms the basis of causality and comparative effectiveness relationship in epidemiology.We firstly constructed a Chinese terminology knowledge graph to deal with the diversity of vocabularies on regional platform.Secondly,we built special disease case repositories (i.e.,heart failure repository) that utilize the graph to search the related patients and to normalize the data.Based on the requirements of the clinical research which aimed to explore the effectiveness of taking statin on 180-days readmission in patients with heart failure,we built a large-scale retrospective cohort with 29647 cases of heart failure patients from the heart failure repository.After the propensity score matching,the study group (n=6346) and the control group (n=6346) with parallel clinical characteristics were acquired.Logistic regression analysis showed that taking statins had a negative correlation with 180-days readmission in heart failure patients.This paper presents the workflow and application example of big data mining based on regional EHR data.
基金Supported by National Basic Research Program of China (973 Program) (2009CB320601), National Natural Science Foundation of China (60774048, 60821063), the Program for Cheung Kong Scholars, and the Research Fund for the Doctoral Program of China Higher Education (20070145015)
文摘这份报纸学习样品数据的问题为有变化时间的延期的不明确的连续时间的模糊大规模系统的可靠 H 夸张控制。第一,模糊夸张模型( FHM )被用来为某些复杂大规模系统建立模型,然后根据 Lyapunov 指导方法和大规模系统的分散的控制理论,线性 matrixine 质量( LMI )基于条件 arederived toguarantee H 性能不仅当所有控制部件正在操作很好时,而且面对一些可能的致动器失败。而且,致动器的精确失败参数没被要求,并且要求仅仅是失败参数的更低、上面的界限。条件依赖于时间延期的上面的界限,并且不依赖于变化时间的延期的衍生物。因此,获得的结果是不太保守的。最后,二个例子被提供说明设计过程和它的有效性。