34,677 research outputs found
Document distribution algorithm for load balancing on an extensible Web server architecture
Access latency and load balancing are the two main issues in the design of clustered Web server architecture for achieving high performance. We propose a novel document distribution algorithm for load balancing on a cluster of distributed Web servers. We group Web pages that are likely to be accessed during a request session into a migrating unit, which is used as the basic unit of document placement. A modified binning algorithm is developed to distribute the migrating units among the Web servers to fulfil the load balancing. We also present a redirection mechanism, which makes use of a migrating unit's property, to reduce the cost of request redirections. The distribution of Web documents would be recomputed periodically to adapt to the changes in client request patterns and system configuration. Simulation results show that our solution can reduce the amount of request redirection and document migration, and it can distribute workload properly among Web servers.published_or_final_versio
Push-Pull Messaging: a high-performance communication mechanism for commodity SMP clusters
Push-Pull Messaging is a novel messaging mechanism for high-speed interprocess communication in a cluster of symmetric multi-processors (SMP) machines. This messaging mechanism exploits the parallelism in SMP nodes by allowing the execution of communication stages of a messaging event on different processors to achieve maximum performance. Push-Pull Messaging facilitates further improvement on communication performance by employing three optimizing techniques in our design: (1) Cross-Space Zero Buffer provides a unified buffer management mechanism to achieve a copy-less communication for the data transfer among processes within a SMP node. (2) Address Translation Overhead Masking removes the address translation overhead from the critical path in the internode communication. (3) Push-and-Acknowledge Overlapping overlaps the push and acknowledge phases to hide the acknowledge latency. Overall, Push-Pull Messaging effectively utilizes the system resources and improves the communication speed. It has been implemented to support high-speed communication for connecting quad Pentium Pro SMPs with 100 Mbit/s Fast Ethernet.published_or_final_versio
Contention-Free Complete Exchange Algorithms on Clusters
To construct a large commodity clustec a hierarchical network is generally adopted for connecting the host muchines, where a Gigabit backbone switch connects a few commodity switches with uplinks to achieve scaled bisectional bandwidth. This type of interconnection usually results in link contention and has congestion developed at the uplink ports. Moreover, the non-detenninistic delays on scheduling communication events in clusters accelerate the building up of congestion amongst these uplink ports, which lead to severe packets drop and hinder the overall performance. In this paper, we focus on the practical design of high-speed complete exchange algorithm on a commodity cluster interconnected by a hierarchical Ethemet-based
network. By exploiting some architectural characteristics of the interconnection in optimizing the performunce of a complete exchange algorithm, we introduce a congestion control mechanism - global windowing that monitors and regulates the trafic load, together with a permutation scheme - reorder scheme that effectively alleviates the congestion problem. We evaluate our algorithm and compare its performance with other algorithms in a PC cluster connected by various types of switches, including Gigabit Ethernet, input-buffered and shared-memory Fast Ethernet switches.published_or_final_versio
Efficient reliable broadcast for commodity clusters
High-speed collective communication is the key to achieve high-performance computing in parallel computing. In the past, collective operations are usually implemented using unicast operations. We proposed a new architecture EQA (Enhanced Queue Architecture) for implementing high-speed collective operations in a cluster. With the incorporation of EQA and the hardware broadcast facility in network switches, an efficient reliable broadcast operation is implemented in a DP-SMP communication subsystem. With EQA, the computation, memory and network resources can be utilized efficiently. We evaluated the performance of the broadcast operation in a commodity cluster with fast Ethernet connection. We found that the hardware-based broadcast from DP-SMP with EQA outperforms the software-based broadcast operation. The use of EQA in broadcast operation could reduce the memory consumption by almost 40%. DP-SMP with EQA has proven to be an efficient communication mechanism for coupling commodity clusters.published_or_final_versio
GPS calibrated ad-hoc localization for geosocial networking
LNCS v. 6406 is conference proceedings of UIC 2010Cost-effective localization for large-scale Geosocial networking service is a challenging issue in urban environment. This paper studies an ad-hoc localization technique which takes advantages of short-range interchanged location information for calibrating the location of mobile users carrying non-GPS mobile phones. We demonstrate by simulation that a small percentage of GPS-enabled mobile phones can greatly enable the localization of other non-GPS pedestrians in the urban environment. Based on the proposed localization technique, we implement a location-aware social networking tool called Mobile Twitter, similar to the microblogging service of Twitter, for fast propagation of social events happening in surroundings. Evaluation shows the our localization algorithm can achieve better accuracy of the location estimation and wider coverage as compared with the Amorphous algorithm and the Monte Carlo Localization (MCL) method. Moreover, we show that the Mobile Twitter implemented on an Android mobile phone is power-efficient in real-life usage scenarios. © 2010 Springer-Verlag.postprintThe 7th International Conference on Ubiquitous Intelligence and Computing (UIC) 2010, Xi'an, China, 26-29 October 2010. In Lecture Notes in Computer Science, 2010, v. 6406, p. 52-6
Scheduling parallel machines with inclusive processing set restrictions and job release times
2009-2010 > Academic research: refereed > Publication in refereed journalAccepted ManuscriptPublishe
Conditional Image-Text Embedding Networks
This paper presents an approach for grounding phrases in images which jointly
learns multiple text-conditioned embeddings in a single end-to-end model. In
order to differentiate text phrases into semantically distinct subspaces, we
propose a concept weight branch that automatically assigns phrases to
embeddings, whereas prior works predefine such assignments. Our proposed
solution simplifies the representation requirements for individual embeddings
and allows the underrepresented concepts to take advantage of the shared
representations before feeding them into concept-specific layers. Comprehensive
experiments verify the effectiveness of our approach across three phrase
grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, where
we obtain a (resp.) 4%, 3%, and 4% improvement in grounding performance over a
strong region-phrase embedding baseline.Comment: ECCV 2018 accepted pape
Cache affinity optimization techniques for scaling software transactional memory systems on multi-CMP architectures
Software transactional memory (STM) enhances both ease-of-use and concurrency, and is considered one of the next-generation paradigms for parallel programming. Application programs may see hotspots where data conflicts are intensive and seriously degrade the performance. So advanced STM systems employ dynamic concurrency control techniques to curb the conflict rate through properly throttling the rate of spawning transactions. High-end computers may have two or more multicore processors so that data sharing among cores goes through a non-uniform cache memory hierarchy. This poses challenges to concurrency control designs as improper metadata placement and sharing will introduce scalability issues to the system. Poor thread-to-core mappings that induce excessive cache invalidation are also detrimental to the overall performance. In this paper, we share our experience in designing and implementing a new dynamic concurrency controller for Tiny STM, which helps keeping the system concurrency at a near-optimal level. By decoupling unfavourable metadata sharing, our controller design avoids costly inter-processor communications. It also features an affinity-aware thread migration technique that fine-tunes thread placements by observing inter-thread transactional conflicts. We evaluate our implementation using the STAMP benchmark suite and show that the controller can bring around 21% average speedup over the baseline execution. © 2015 IEEE.postprin
High performance communication subsystem for clustering standard high-volume servers using Gigabit Ethernet
This paper presents an efficient communication subsystem, DP-II, for clustering standard high-volume (SHV) servers using Gigabit Ethernet. The DP-II employs several lightweight messaging mechanisms to achieve low-latency and high-bandwidth communication. The test shows an 18.32 us single-trip latency and 72.8 MB/s bandwidth on a Gigabit Ethernet network for connecting two Dell PowerEdge 6300 Quad Xeon SMP servers running Linux. To improve the programmability of the DP-II communication subsystem, the development of DP-II was based on a concise yet powerful abstract communication model, Directed Point Model, which can be conveniently used to depict the inter-process communication pattern of a parallel task in the cluster environment. In addition, the API of DP-II preserves the syntax and semantics of traditional UNIX I/O operations, which make it easy to use.published_or_final_versio
- …
